/home/aiscuser/.local/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
2023/07/19 14:34:09 WARNING mlflow.utils.autologging_utils: You are using an unsupported version of transformers. If you encounter errors during autologging, try upgrading / downgrading transformers to a supported version, or try upgrading MLflow.
2023/07/19 14:34:09 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2023/07/19 14:34:09 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers.
Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Downloading and preparing dataset glue/qqp to /home/aiscuser/.cache/huggingface/datasets/glue/qqp/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353...
Downloading data:   0%|          | 0.00/41.7M [00:00<?, ?B/s]Downloading data:  24%|██▍       | 10.1M/41.7M [00:00<00:00, 101MB/s]Downloading data:  50%|█████     | 21.0M/41.7M [00:00<00:00, 106MB/s]Downloading data:  76%|███████▌  | 31.8M/41.7M [00:00<00:00, 107MB/s]Downloading data: 100%|██████████| 41.7M/41.7M [00:00<00:00, 106MB/s]
Generating train split: 0 examples [00:00, ? examples/s]Generating train split: 2703 examples [00:00, 26940.36 examples/s]Generating train split: 5524 examples [00:00, 27683.87 examples/s]Generating train split: 8349 examples [00:00, 27938.11 examples/s]Generating train split: 11169 examples [00:00, 28040.85 examples/s]Generating train split: 14000 examples [00:00, 28094.25 examples/s]Generating train split: 16907 examples [00:00, 28424.19 examples/s]Generating train split: 21073 examples [00:00, 28146.43 examples/s]Generating train split: 23983 examples [00:00, 28417.32 examples/s]Generating train split: 28124 examples [00:01, 28104.44 examples/s]Generating train split: 31000 examples [00:01, 28150.66 examples/s]Generating train split: 33917 examples [00:01, 28426.63 examples/s]Generating train split: 38092 examples [00:01, 28203.44 examples/s]Generating train split: 41000 examples [00:01, 28254.73 examples/s]Generating train split: 43874 examples [00:01, 28381.94 examples/s]Generating train split: 48074 examples [00:01, 28239.55 examples/s]Generating train split: 50988 examples [00:01, 28470.37 examples/s]Generating train split: 55188 examples [00:01, 28299.53 examples/s]Generating train split: 59462 examples [00:02, 28349.04 examples/s]Generating train split: 63754 examples [00:02, 28432.59 examples/s]Generating train split: 68000 examples [00:02, 28280.62 examples/s]Generating train split: 70918 examples [00:02, 28486.68 examples/s]Generating train split: 75116 examples [00:02, 28317.32 examples/s]Generating train split: 78000 examples [00:02, 28331.32 examples/s]Generating train split: 80923 examples [00:02, 28557.10 examples/s]Generating train split: 85091 examples [00:03, 28283.38 examples/s]Generating train split: 88000 examples [00:03, 28281.60 examples/s]Generating train split: 90927 examples [00:03, 28539.94 examples/s]Generating train split: 95078 examples [00:03, 28225.46 examples/s]Generating train split: 98000 examples [00:03, 28260.05 examples/s]Generating train split: 100918 examples [00:03, 28501.44 examples/s]Generating train split: 105118 examples [00:03, 28317.09 examples/s]Generating train split: 108000 examples [00:03, 28356.08 examples/s]Generating train split: 110937 examples [00:03, 28625.15 examples/s]Generating train split: 115146 examples [00:04, 28416.67 examples/s]Generating train split: 118000 examples [00:04, 28404.41 examples/s]Generating train split: 120928 examples [00:04, 28639.15 examples/s]Generating train split: 125148 examples [00:04, 28452.10 examples/s]Generating train split: 129458 examples [00:04, 28532.83 examples/s]Generating train split: 133761 examples [00:04, 28578.57 examples/s]Generating train split: 138000 examples [00:04, 28437.39 examples/s]Generating train split: 140931 examples [00:04, 28637.47 examples/s]Generating train split: 145112 examples [00:05, 28381.04 examples/s]Generating train split: 148000 examples [00:05, 28403.01 examples/s]Generating train split: 150937 examples [00:05, 28648.71 examples/s]Generating train split: 155145 examples [00:05, 28436.95 examples/s]Generating train split: 158000 examples [00:05, 28437.65 examples/s]Generating train split: 160933 examples [00:05, 28672.10 examples/s]Generating train split: 165080 examples [00:05, 28297.64 examples/s]Generating train split: 168000 examples [00:05, 28334.16 examples/s]Generating train split: 170942 examples [00:06, 28622.03 examples/s]Generating train split: 175174 examples [00:06, 28469.76 examples/s]Generating train split: 179459 examples [00:06, 28498.82 examples/s]Generating train split: 183715 examples [00:06, 28454.32 examples/s]Generating train split: 187961 examples [00:06, 28403.98 examples/s]Generating train split: 192105 examples [00:06, 28157.44 examples/s]Generating train split: 195000 examples [00:06, 28178.22 examples/s]Generating train split: 197914 examples [00:06, 28411.61 examples/s]Generating train split: 202084 examples [00:07, 28199.04 examples/s]Generating train split: 204997 examples [00:07, 28428.02 examples/s]Generating train split: 209125 examples [00:07, 28110.13 examples/s]Generating train split: 212000 examples [00:07, 28132.10 examples/s]Generating train split: 214918 examples [00:07, 28402.17 examples/s]Generating train split: 219022 examples [00:07, 28024.46 examples/s]Generating train split: 221943 examples [00:07, 28324.63 examples/s]Generating train split: 226119 examples [00:07, 28151.18 examples/s]Generating train split: 229000 examples [00:08, 28115.89 examples/s]Generating train split: 231916 examples [00:08, 28385.56 examples/s]Generating train split: 236081 examples [00:08, 28159.49 examples/s]Generating train split: 238944 examples [00:08, 28278.05 examples/s]Generating train split: 243121 examples [00:08, 28122.76 examples/s]Generating train split: 246000 examples [00:08, 28122.98 examples/s]Generating train split: 248906 examples [00:08, 28367.08 examples/s]Generating train split: 253082 examples [00:08, 28173.15 examples/s]Generating train split: 256000 examples [00:09, 28220.54 examples/s]Generating train split: 258919 examples [00:09, 28478.32 examples/s]Generating train split: 263130 examples [00:09, 28325.75 examples/s]Generating train split: 266000 examples [00:09, 28356.29 examples/s]Generating train split: 268914 examples [00:09, 28563.95 examples/s]Generating train split: 273095 examples [00:09, 28309.17 examples/s]Generating train split: 276000 examples [00:09, 28346.31 examples/s]Generating train split: 278931 examples [00:09, 28602.33 examples/s]Generating train split: 283096 examples [00:09, 28293.43 examples/s]Generating train split: 285984 examples [00:10, 28441.60 examples/s]Generating train split: 290127 examples [00:10, 28144.65 examples/s]Generating train split: 293000 examples [00:10, 28147.82 examples/s]Generating train split: 295908 examples [00:10, 28392.06 examples/s]Generating train split: 300115 examples [00:10, 28263.23 examples/s]Generating train split: 303000 examples [00:10, 28271.84 examples/s]Generating train split: 305929 examples [00:10, 28542.86 examples/s]Generating train split: 310079 examples [00:10, 28219.94 examples/s]Generating train split: 312998 examples [00:11, 28468.89 examples/s]Generating train split: 317185 examples [00:11, 28268.75 examples/s]Generating train split: 321458 examples [00:11, 28339.08 examples/s]Generating train split: 325701 examples [00:11, 28318.76 examples/s]Generating train split: 329981 examples [00:11, 28383.88 examples/s]Generating train split: 334144 examples [00:11, 28185.32 examples/s]Generating train split: 337000 examples [00:11, 28210.79 examples/s]Generating train split: 339921 examples [00:11, 28455.08 examples/s]Generating train split: 344139 examples [00:12, 28334.77 examples/s]Generating train split: 347000 examples [00:12, 28357.62 examples/s]Generating train split: 349922 examples [00:12, 28578.72 examples/s]Generating train split: 354112 examples [00:12, 28344.76 examples/s]Generating train split: 357000 examples [00:12, 28268.98 examples/s]Generating train split: 359929 examples [00:12, 28535.96 examples/s]Generating train split: 363846 examples [00:12, 28491.05 examples/s]                                                                    Generating validation split: 0 examples [00:00, ? examples/s]Generating validation split: 2877 examples [00:00, 28644.50 examples/s]Generating validation split: 7071 examples [00:00, 28170.36 examples/s]Generating validation split: 9982 examples [00:00, 28537.52 examples/s]Generating validation split: 14152 examples [00:00, 28199.22 examples/s]Generating validation split: 17000 examples [00:00, 28222.58 examples/s]Generating validation split: 19879 examples [00:00, 28390.89 examples/s]Generating validation split: 24087 examples [00:00, 28257.59 examples/s]Generating validation split: 27000 examples [00:00, 28293.81 examples/s]Generating validation split: 29938 examples [00:01, 28592.65 examples/s]Generating validation split: 34153 examples [00:01, 28403.62 examples/s]Generating validation split: 38472 examples [00:01, 28446.90 examples/s]                                                                        Generating test split: 0 examples [00:00, ? examples/s]Generating test split: 3000 examples [00:00, 28931.62 examples/s]Generating test split: 6000 examples [00:00, 29023.44 examples/s]Generating test split: 9000 examples [00:00, 29253.54 examples/s]Generating test split: 12000 examples [00:00, 29385.34 examples/s]Generating test split: 15000 examples [00:00, 29454.76 examples/s]Generating test split: 18000 examples [00:00, 29517.62 examples/s]Generating test split: 21000 examples [00:00, 29598.47 examples/s]Generating test split: 24000 examples [00:00, 29651.28 examples/s]Generating test split: 27000 examples [00:00, 29599.87 examples/s]Generating test split: 30000 examples [00:01, 29653.71 examples/s]Generating test split: 33000 examples [00:01, 29625.19 examples/s]Generating test split: 36103 examples [00:01, 18900.92 examples/s]Generating test split: 39069 examples [00:01, 21192.74 examples/s]Generating test split: 42006 examples [00:01, 23099.67 examples/s]Generating test split: 45000 examples [00:01, 24723.13 examples/s]Generating test split: 48000 examples [00:01, 26027.94 examples/s]Generating test split: 51000 examples [00:01, 27032.86 examples/s]Generating test split: 54000 examples [00:02, 27784.26 examples/s]Generating test split: 57000 examples [00:02, 28239.09 examples/s]Generating test split: 60000 examples [00:02, 28594.35 examples/s]Generating test split: 62984 examples [00:02, 28952.82 examples/s]Generating test split: 65925 examples [00:02, 29084.72 examples/s]Generating test split: 68884 examples [00:02, 29229.90 examples/s]Generating test split: 73205 examples [00:02, 29063.35 examples/s]Generating test split: 76175 examples [00:02, 29230.91 examples/s]Generating test split: 80623 examples [00:02, 29383.61 examples/s]Generating test split: 85000 examples [00:03, 29188.90 examples/s]Generating test split: 88000 examples [00:03, 29187.37 examples/s]Generating test split: 90993 examples [00:03, 29375.79 examples/s]Generating test split: 95304 examples [00:03, 29147.15 examples/s]Generating test split: 98229 examples [00:03, 29171.35 examples/s]Generating test split: 101163 examples [00:03, 29212.81 examples/s]Generating test split: 104092 examples [00:03, 29232.45 examples/s]Generating test split: 107032 examples [00:03, 29276.99 examples/s]Generating test split: 110000 examples [00:03, 29300.58 examples/s]Generating test split: 113000 examples [00:04, 29327.56 examples/s]Generating test split: 116000 examples [00:04, 29293.06 examples/s]Generating test split: 119000 examples [00:04, 29314.24 examples/s]Generating test split: 122000 examples [00:04, 29385.97 examples/s]Generating test split: 125000 examples [00:04, 29419.80 examples/s]Generating test split: 127980 examples [00:04, 29528.58 examples/s]Generating test split: 132325 examples [00:04, 29308.74 examples/s]Generating test split: 135273 examples [00:04, 29353.99 examples/s]Generating test split: 138217 examples [00:04, 29374.19 examples/s]Generating test split: 141164 examples [00:04, 29400.55 examples/s]Generating test split: 145625 examples [00:05, 29526.25 examples/s]Generating test split: 150000 examples [00:05, 29396.20 examples/s]Generating test split: 153000 examples [00:05, 29415.23 examples/s]Generating test split: 156000 examples [00:05, 29430.11 examples/s]Generating test split: 159000 examples [00:05, 29486.55 examples/s]Generating test split: 162000 examples [00:05, 29450.92 examples/s]Generating test split: 164980 examples [00:05, 29547.77 examples/s]Generating test split: 169335 examples [00:05, 29350.96 examples/s]Generating test split: 172287 examples [00:06, 29392.29 examples/s]Generating test split: 176755 examples [00:06, 29532.08 examples/s]Generating test split: 179709 examples [00:06, 29531.99 examples/s]Generating test split: 182670 examples [00:06, 29550.39 examples/s]Generating test split: 187045 examples [00:06, 29406.79 examples/s]Generating test split: 190000 examples [00:06, 29441.02 examples/s]Generating test split: 193000 examples [00:06, 29469.56 examples/s]Generating test split: 196000 examples [00:06, 29441.81 examples/s]Generating test split: 199000 examples [00:06, 29446.33 examples/s]Generating test split: 201949 examples [00:07, 29456.63 examples/s]Generating test split: 206314 examples [00:07, 29314.97 examples/s]Generating test split: 209264 examples [00:07, 29363.82 examples/s]Generating test split: 212218 examples [00:07, 29410.43 examples/s]Generating test split: 215176 examples [00:07, 29456.15 examples/s]Generating test split: 218140 examples [00:07, 29506.77 examples/s]Generating test split: 222602 examples [00:07, 29597.45 examples/s]Generating test split: 227000 examples [00:07, 29466.82 examples/s]Generating test split: 230000 examples [00:08, 29435.47 examples/s]Generating test split: 233000 examples [00:08, 29451.95 examples/s]Generating test split: 236000 examples [00:08, 29468.03 examples/s]Generating test split: 239000 examples [00:08, 29502.90 examples/s]Generating test split: 242000 examples [00:08, 29567.66 examples/s]Generating test split: 245000 examples [00:08, 29582.30 examples/s]Generating test split: 248000 examples [00:08, 29582.99 examples/s]Generating test split: 251000 examples [00:08, 29574.85 examples/s]Generating test split: 254000 examples [00:08, 29553.70 examples/s]Generating test split: 257000 examples [00:08, 29558.96 examples/s]Generating test split: 260000 examples [00:09, 29603.87 examples/s]Generating test split: 263000 examples [00:09, 29557.72 examples/s]Generating test split: 266000 examples [00:09, 29464.25 examples/s]Generating test split: 269000 examples [00:09, 29439.48 examples/s]Generating test split: 272000 examples [00:09, 29373.34 examples/s]Generating test split: 275000 examples [00:09, 29423.55 examples/s]Generating test split: 278000 examples [00:09, 29476.18 examples/s]Generating test split: 281000 examples [00:09, 29434.25 examples/s]Generating test split: 284000 examples [00:09, 29363.18 examples/s]Generating test split: 287000 examples [00:09, 29389.86 examples/s]Generating test split: 290000 examples [00:10, 29383.64 examples/s]Generating test split: 293000 examples [00:10, 29394.77 examples/s]Generating test split: 296000 examples [00:10, 29431.18 examples/s]Generating test split: 299000 examples [00:10, 29487.39 examples/s]Generating test split: 302000 examples [00:10, 29479.14 examples/s]Generating test split: 305000 examples [00:10, 29431.22 examples/s]Generating test split: 308000 examples [00:10, 29365.70 examples/s]Generating test split: 311000 examples [00:10, 29417.42 examples/s]Generating test split: 314000 examples [00:10, 29443.04 examples/s]Generating test split: 317000 examples [00:10, 29491.49 examples/s]Generating test split: 320000 examples [00:11, 29495.28 examples/s]Generating test split: 323000 examples [00:11, 29520.62 examples/s]Generating test split: 326000 examples [00:11, 29543.21 examples/s]Generating test split: 329000 examples [00:11, 29586.76 examples/s]Generating test split: 332000 examples [00:11, 29626.12 examples/s]Generating test split: 335000 examples [00:11, 29651.48 examples/s]Generating test split: 338000 examples [00:11, 29655.49 examples/s]Generating test split: 341000 examples [00:11, 29639.33 examples/s]Generating test split: 344000 examples [00:11, 29611.38 examples/s]Generating test split: 347000 examples [00:11, 29632.18 examples/s]Generating test split: 350000 examples [00:12, 29650.86 examples/s]Generating test split: 353000 examples [00:12, 29590.85 examples/s]Generating test split: 355991 examples [00:12, 29682.82 examples/s]Generating test split: 360379 examples [00:12, 29512.56 examples/s]Generating test split: 363345 examples [00:12, 29550.06 examples/s]Generating test split: 366309 examples [00:12, 29572.47 examples/s]Generating test split: 370792 examples [00:12, 29686.07 examples/s]Generating test split: 375184 examples [00:12, 29542.13 examples/s]Generating test split: 379635 examples [00:13, 29582.18 examples/s]Generating test split: 384011 examples [00:13, 29448.01 examples/s]Generating test split: 387000 examples [00:13, 29485.07 examples/s]Generating test split: 390000 examples [00:13, 29514.94 examples/s]                                                                   Dataset glue downloaded and prepared to /home/aiscuser/.cache/huggingface/datasets/glue/qqp/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353. Subsequent calls will reuse this data.
  0%|          | 0/3 [00:00<?, ?it/s]100%|██████████| 3/3 [00:00<00:00, 209.74it/s]
disable token pruning.
enable token pruning. token_prune_loc: [3, 4, 5, 6, 7, 8, 9, 10, 11]
NOTICE: THIS IS PRUNING STAGE
max_seq_length: 256
Running tokenizer on dataset:   0%|          | 0/363846 [00:00<?, ? examples/s]Running tokenizer on dataset:   0%|          | 1000/363846 [00:00<01:18, 4603.73 examples/s]Running tokenizer on dataset:   1%|          | 2000/363846 [00:00<01:15, 4781.92 examples/s]Running tokenizer on dataset:   1%|          | 3000/363846 [00:00<01:43, 3475.23 examples/s]Running tokenizer on dataset:   1%|          | 4000/363846 [00:01<01:33, 3847.02 examples/s]Running tokenizer on dataset:   1%|▏         | 5000/363846 [00:01<01:25, 4190.50 examples/s]Running tokenizer on dataset:   2%|▏         | 6000/363846 [00:01<01:20, 4440.56 examples/s]Running tokenizer on dataset:   2%|▏         | 7000/363846 [00:01<01:18, 4543.07 examples/s]Running tokenizer on dataset:   2%|▏         | 8000/363846 [00:01<01:16, 4663.97 examples/s]Running tokenizer on dataset:   2%|▏         | 9000/363846 [00:02<01:15, 4681.32 examples/s]Running tokenizer on dataset:   3%|▎         | 10000/363846 [00:02<01:15, 4698.49 examples/s]Running tokenizer on dataset:   3%|▎         | 11000/363846 [00:02<01:14, 4755.56 examples/s]Running tokenizer on dataset:   3%|▎         | 12000/363846 [00:02<01:12, 4846.89 examples/s]Running tokenizer on dataset:   4%|▎         | 13000/363846 [00:02<01:12, 4860.07 examples/s]Running tokenizer on dataset:   4%|▍         | 14000/363846 [00:03<01:11, 4883.10 examples/s]Running tokenizer on dataset:   4%|▍         | 15000/363846 [00:03<01:12, 4825.94 examples/s]Running tokenizer on dataset:   4%|▍         | 16000/363846 [00:03<01:11, 4849.59 examples/s]Running tokenizer on dataset:   5%|▍         | 17000/363846 [00:03<01:10, 4894.93 examples/s]Running tokenizer on dataset:   5%|▍         | 18000/363846 [00:03<01:10, 4899.63 examples/s]Running tokenizer on dataset:   5%|▌         | 19000/363846 [00:04<01:10, 4886.28 examples/s]Running tokenizer on dataset:   5%|▌         | 20000/363846 [00:04<01:10, 4862.97 examples/s]Running tokenizer on dataset:   6%|▌         | 21000/363846 [00:04<01:29, 3818.14 examples/s]Running tokenizer on dataset:   6%|▌         | 22000/363846 [00:04<01:24, 4046.52 examples/s]Running tokenizer on dataset:   6%|▋         | 23000/363846 [00:05<01:20, 4258.45 examples/s]Running tokenizer on dataset:   7%|▋         | 24000/363846 [00:05<01:17, 4363.68 examples/s]Running tokenizer on dataset:   7%|▋         | 25000/363846 [00:05<01:15, 4490.31 examples/s]Running tokenizer on dataset:   7%|▋         | 26000/363846 [00:05<01:13, 4612.35 examples/s]Running tokenizer on dataset:   7%|▋         | 27000/363846 [00:05<01:11, 4718.51 examples/s]Running tokenizer on dataset:   8%|▊         | 28000/363846 [00:06<01:10, 4755.17 examples/s]Running tokenizer on dataset:   8%|▊         | 29000/363846 [00:06<01:10, 4771.58 examples/s]Running tokenizer on dataset:   8%|▊         | 30000/363846 [00:06<01:09, 4780.03 examples/s]Running tokenizer on dataset:   9%|▊         | 31000/363846 [00:06<01:09, 4813.45 examples/s]Running tokenizer on dataset:   9%|▉         | 32000/363846 [00:06<01:08, 4832.53 examples/s]Running tokenizer on dataset:   9%|▉         | 33000/363846 [00:07<01:08, 4837.63 examples/s]Running tokenizer on dataset:   9%|▉         | 34000/363846 [00:07<01:08, 4850.59 examples/s]Running tokenizer on dataset:  10%|▉         | 35000/363846 [00:07<01:07, 4872.07 examples/s]Running tokenizer on dataset:  10%|▉         | 36000/363846 [00:07<01:07, 4877.89 examples/s]Running tokenizer on dataset:  10%|█         | 37000/363846 [00:07<01:07, 4842.27 examples/s]Running tokenizer on dataset:  10%|█         | 38000/363846 [00:08<01:07, 4841.09 examples/s]Running tokenizer on dataset:  11%|█         | 39000/363846 [00:08<01:06, 4880.28 examples/s]Running tokenizer on dataset:  11%|█         | 40000/363846 [00:08<01:06, 4893.86 examples/s]Running tokenizer on dataset:  11%|█▏        | 41000/363846 [00:08<01:06, 4890.54 examples/s]Running tokenizer on dataset:  12%|█▏        | 42000/363846 [00:09<01:25, 3782.94 examples/s]Running tokenizer on dataset:  12%|█▏        | 43000/363846 [00:09<01:19, 4053.36 examples/s]Running tokenizer on dataset:  12%|█▏        | 44000/363846 [00:09<01:14, 4265.06 examples/s]Running tokenizer on dataset:  12%|█▏        | 45000/363846 [00:09<01:11, 4454.21 examples/s]Running tokenizer on dataset:  13%|█▎        | 46000/363846 [00:10<01:09, 4584.79 examples/s]Running tokenizer on dataset:  13%|█▎        | 47000/363846 [00:10<01:07, 4669.22 examples/s]Running tokenizer on dataset:  13%|█▎        | 48000/363846 [00:10<01:06, 4717.41 examples/s]Running tokenizer on dataset:  13%|█▎        | 49000/363846 [00:10<01:06, 4762.71 examples/s]Running tokenizer on dataset:  14%|█▎        | 50000/363846 [00:10<01:06, 4727.05 examples/s]Running tokenizer on dataset:  14%|█▍        | 51000/363846 [00:11<01:05, 4765.32 examples/s]Running tokenizer on dataset:  14%|█▍        | 52000/363846 [00:11<01:05, 4768.22 examples/s]Running tokenizer on dataset:  15%|█▍        | 53000/363846 [00:11<01:05, 4781.71 examples/s]Running tokenizer on dataset:  15%|█▍        | 54000/363846 [00:11<01:04, 4817.82 examples/s]Running tokenizer on dataset:  15%|█▌        | 55000/363846 [00:11<01:03, 4827.78 examples/s]Running tokenizer on dataset:  15%|█▌        | 56000/363846 [00:12<01:03, 4838.07 examples/s]Running tokenizer on dataset:  16%|█▌        | 57000/363846 [00:12<01:03, 4853.67 examples/s]Running tokenizer on dataset:  16%|█▌        | 58000/363846 [00:12<01:03, 4831.49 examples/s]Running tokenizer on dataset:  16%|█▌        | 59000/363846 [00:12<01:02, 4846.34 examples/s]Running tokenizer on dataset:  16%|█▋        | 60000/363846 [00:12<01:02, 4868.67 examples/s]Running tokenizer on dataset:  17%|█▋        | 61000/363846 [00:13<01:18, 3845.16 examples/s]Running tokenizer on dataset:  17%|█▋        | 62000/363846 [00:13<01:13, 4099.18 examples/s]Running tokenizer on dataset:  17%|█▋        | 63000/363846 [00:13<01:09, 4314.32 examples/s]Running tokenizer on dataset:  18%|█▊        | 64000/363846 [00:13<01:06, 4478.11 examples/s]Running tokenizer on dataset:  18%|█▊        | 65000/363846 [00:14<01:04, 4608.21 examples/s]Running tokenizer on dataset:  18%|█▊        | 66000/363846 [00:14<01:03, 4686.79 examples/s]Running tokenizer on dataset:  18%|█▊        | 67000/363846 [00:14<01:02, 4722.82 examples/s]Running tokenizer on dataset:  19%|█▊        | 68000/363846 [00:14<01:01, 4791.72 examples/s]Running tokenizer on dataset:  19%|█▉        | 69000/363846 [00:14<01:01, 4808.15 examples/s]Running tokenizer on dataset:  19%|█▉        | 70000/363846 [00:15<01:00, 4826.94 examples/s]Running tokenizer on dataset:  20%|█▉        | 71000/363846 [00:15<01:00, 4808.51 examples/s]Running tokenizer on dataset:  20%|█▉        | 72000/363846 [00:15<01:00, 4856.11 examples/s]Running tokenizer on dataset:  20%|██        | 73000/363846 [00:15<00:59, 4861.87 examples/s]Running tokenizer on dataset:  20%|██        | 74000/363846 [00:15<00:59, 4882.43 examples/s]Running tokenizer on dataset:  21%|██        | 75000/363846 [00:16<00:58, 4897.66 examples/s]Running tokenizer on dataset:  21%|██        | 76000/363846 [00:16<00:58, 4901.05 examples/s]Running tokenizer on dataset:  21%|██        | 77000/363846 [00:16<00:58, 4899.22 examples/s]Running tokenizer on dataset:  21%|██▏       | 78000/363846 [00:16<00:58, 4895.41 examples/s]Running tokenizer on dataset:  22%|██▏       | 79000/363846 [00:17<01:13, 3883.52 examples/s]Running tokenizer on dataset:  22%|██▏       | 80000/363846 [00:17<01:08, 4128.29 examples/s]Running tokenizer on dataset:  22%|██▏       | 81000/363846 [00:17<01:05, 4325.93 examples/s]Running tokenizer on dataset:  23%|██▎       | 82000/363846 [00:17<01:02, 4494.17 examples/s]Running tokenizer on dataset:  23%|██▎       | 83000/363846 [00:17<01:00, 4639.38 examples/s]Running tokenizer on dataset:  23%|██▎       | 84000/363846 [00:18<00:59, 4725.42 examples/s]Running tokenizer on dataset:  23%|██▎       | 85000/363846 [00:18<00:58, 4794.94 examples/s]Running tokenizer on dataset:  24%|██▎       | 86000/363846 [00:18<00:57, 4820.28 examples/s]Running tokenizer on dataset:  24%|██▍       | 87000/363846 [00:18<00:57, 4848.95 examples/s]Running tokenizer on dataset:  24%|██▍       | 88000/363846 [00:19<00:56, 4857.73 examples/s]Running tokenizer on dataset:  24%|██▍       | 89000/363846 [00:19<00:56, 4872.75 examples/s]Running tokenizer on dataset:  25%|██▍       | 90000/363846 [00:19<00:56, 4860.73 examples/s]Running tokenizer on dataset:  25%|██▌       | 91000/363846 [00:19<00:56, 4861.09 examples/s]Running tokenizer on dataset:  25%|██▌       | 92000/363846 [00:19<00:55, 4855.56 examples/s]Running tokenizer on dataset:  26%|██▌       | 93000/363846 [00:20<00:55, 4857.80 examples/s]Running tokenizer on dataset:  26%|██▌       | 94000/363846 [00:20<00:55, 4833.42 examples/s]Running tokenizer on dataset:  26%|██▌       | 95000/363846 [00:20<00:55, 4868.39 examples/s]Running tokenizer on dataset:  26%|██▋       | 96000/363846 [00:20<00:54, 4883.30 examples/s]Running tokenizer on dataset:  27%|██▋       | 97000/363846 [00:20<00:54, 4874.49 examples/s]Running tokenizer on dataset:  27%|██▋       | 98000/363846 [00:21<00:54, 4891.61 examples/s]Running tokenizer on dataset:  27%|██▋       | 99000/363846 [00:21<00:54, 4901.40 examples/s]Running tokenizer on dataset:  27%|██▋       | 100000/363846 [00:21<01:08, 3858.30 examples/s]Running tokenizer on dataset:  28%|██▊       | 101000/363846 [00:21<01:03, 4133.91 examples/s]Running tokenizer on dataset:  28%|██▊       | 102000/363846 [00:22<01:00, 4346.06 examples/s]Running tokenizer on dataset:  28%|██▊       | 103000/363846 [00:22<00:58, 4476.09 examples/s]Running tokenizer on dataset:  29%|██▊       | 104000/363846 [00:22<00:56, 4604.45 examples/s]Running tokenizer on dataset:  29%|██▉       | 105000/363846 [00:22<00:55, 4684.28 examples/s]Running tokenizer on dataset:  29%|██▉       | 106000/363846 [00:22<00:54, 4746.11 examples/s]Running tokenizer on dataset:  29%|██▉       | 107000/363846 [00:23<00:53, 4787.18 examples/s]Running tokenizer on dataset:  30%|██▉       | 108000/363846 [00:23<00:53, 4823.91 examples/s]Running tokenizer on dataset:  30%|██▉       | 109000/363846 [00:23<00:52, 4867.25 examples/s]Running tokenizer on dataset:  30%|███       | 110000/363846 [00:23<00:52, 4873.07 examples/s]Running tokenizer on dataset:  31%|███       | 111000/363846 [00:23<00:51, 4880.81 examples/s]Running tokenizer on dataset:  31%|███       | 112000/363846 [00:24<00:51, 4882.69 examples/s]Running tokenizer on dataset:  31%|███       | 113000/363846 [00:24<00:51, 4890.51 examples/s]Running tokenizer on dataset:  31%|███▏      | 114000/363846 [00:24<00:51, 4898.60 examples/s]Running tokenizer on dataset:  32%|███▏      | 115000/363846 [00:24<00:50, 4907.49 examples/s]Running tokenizer on dataset:  32%|███▏      | 116000/363846 [00:24<00:50, 4906.03 examples/s]Running tokenizer on dataset:  32%|███▏      | 117000/363846 [00:25<00:50, 4895.36 examples/s]Running tokenizer on dataset:  32%|███▏      | 118000/363846 [00:25<00:50, 4911.94 examples/s]Running tokenizer on dataset:  33%|███▎      | 119000/363846 [00:25<01:01, 3985.77 examples/s]Running tokenizer on dataset:  33%|███▎      | 120000/363846 [00:25<00:57, 4220.03 examples/s]Running tokenizer on dataset:  33%|███▎      | 121000/363846 [00:26<00:54, 4416.75 examples/s]Running tokenizer on dataset:  34%|███▎      | 122000/363846 [00:26<00:53, 4495.91 examples/s]Running tokenizer on dataset:  34%|███▍      | 123000/363846 [00:26<00:52, 4616.96 examples/s]Running tokenizer on dataset:  34%|███▍      | 124000/363846 [00:26<00:51, 4701.93 examples/s]Running tokenizer on dataset:  34%|███▍      | 125000/363846 [00:26<00:50, 4756.45 examples/s]Running tokenizer on dataset:  35%|███▍      | 126000/363846 [00:27<00:49, 4774.01 examples/s]Running tokenizer on dataset:  35%|███▍      | 127000/363846 [00:27<00:49, 4807.81 examples/s]Running tokenizer on dataset:  35%|███▌      | 128000/363846 [00:27<00:48, 4844.90 examples/s]Running tokenizer on dataset:  35%|███▌      | 129000/363846 [00:27<00:48, 4886.57 examples/s]Running tokenizer on dataset:  36%|███▌      | 130000/363846 [00:27<00:47, 4875.25 examples/s]Running tokenizer on dataset:  36%|███▌      | 131000/363846 [00:28<00:47, 4878.91 examples/s]Running tokenizer on dataset:  36%|███▋      | 132000/363846 [00:28<00:47, 4881.07 examples/s]Running tokenizer on dataset:  37%|███▋      | 133000/363846 [00:28<00:47, 4897.69 examples/s]Running tokenizer on dataset:  37%|███▋      | 134000/363846 [00:28<00:46, 4906.18 examples/s]Running tokenizer on dataset:  37%|███▋      | 135000/363846 [00:28<00:46, 4875.63 examples/s]Running tokenizer on dataset:  37%|███▋      | 136000/363846 [00:29<00:46, 4891.74 examples/s]Running tokenizer on dataset:  38%|███▊      | 137000/363846 [00:29<00:58, 3855.43 examples/s]Running tokenizer on dataset:  38%|███▊      | 138000/363846 [00:29<00:54, 4114.16 examples/s]Running tokenizer on dataset:  38%|███▊      | 139000/363846 [00:29<00:51, 4330.20 examples/s]Running tokenizer on dataset:  38%|███▊      | 140000/363846 [00:30<00:49, 4489.09 examples/s]Running tokenizer on dataset:  39%|███▉      | 141000/363846 [00:30<00:48, 4617.35 examples/s]Running tokenizer on dataset:  39%|███▉      | 142000/363846 [00:30<00:47, 4711.81 examples/s]Running tokenizer on dataset:  39%|███▉      | 143000/363846 [00:30<00:46, 4770.25 examples/s]Running tokenizer on dataset:  40%|███▉      | 144000/363846 [00:30<00:45, 4811.10 examples/s]Running tokenizer on dataset:  40%|███▉      | 145000/363846 [00:31<00:45, 4840.18 examples/s]Running tokenizer on dataset:  40%|████      | 146000/363846 [00:31<00:44, 4866.24 examples/s]Running tokenizer on dataset:  40%|████      | 147000/363846 [00:31<00:44, 4881.19 examples/s]Running tokenizer on dataset:  41%|████      | 148000/363846 [00:31<00:44, 4890.19 examples/s]Running tokenizer on dataset:  41%|████      | 149000/363846 [00:31<00:43, 4906.20 examples/s]Running tokenizer on dataset:  41%|████      | 150000/363846 [00:32<00:43, 4874.41 examples/s]Running tokenizer on dataset:  42%|████▏     | 151000/363846 [00:32<00:43, 4905.51 examples/s]Running tokenizer on dataset:  42%|████▏     | 152000/363846 [00:32<00:43, 4905.67 examples/s]Running tokenizer on dataset:  42%|████▏     | 153000/363846 [00:32<00:43, 4877.84 examples/s]Running tokenizer on dataset:  42%|████▏     | 154000/363846 [00:33<00:42, 4894.44 examples/s]Running tokenizer on dataset:  43%|████▎     | 155000/363846 [00:33<00:42, 4883.48 examples/s]Running tokenizer on dataset:  43%|████▎     | 156000/363846 [00:33<00:42, 4842.00 examples/s]Running tokenizer on dataset:  43%|████▎     | 157000/363846 [00:33<00:42, 4837.17 examples/s]Running tokenizer on dataset:  43%|████▎     | 158000/363846 [00:33<00:52, 3935.15 examples/s]Running tokenizer on dataset:  44%|████▎     | 159000/363846 [00:34<00:49, 4178.43 examples/s]Running tokenizer on dataset:  44%|████▍     | 160000/363846 [00:34<00:46, 4373.06 examples/s]Running tokenizer on dataset:  44%|████▍     | 161000/363846 [00:34<00:45, 4503.18 examples/s]Running tokenizer on dataset:  45%|████▍     | 162000/363846 [00:34<00:43, 4626.93 examples/s]Running tokenizer on dataset:  45%|████▍     | 163000/363846 [00:35<00:42, 4724.86 examples/s]Running tokenizer on dataset:  45%|████▌     | 164000/363846 [00:35<00:41, 4800.82 examples/s]Running tokenizer on dataset:  45%|████▌     | 165000/363846 [00:35<00:41, 4831.85 examples/s]Running tokenizer on dataset:  46%|████▌     | 166000/363846 [00:35<00:40, 4859.82 examples/s]Running tokenizer on dataset:  46%|████▌     | 167000/363846 [00:35<00:40, 4885.06 examples/s]Running tokenizer on dataset:  46%|████▌     | 168000/363846 [00:36<00:39, 4897.09 examples/s]Running tokenizer on dataset:  46%|████▋     | 169000/363846 [00:36<00:39, 4911.83 examples/s]Running tokenizer on dataset:  47%|████▋     | 170000/363846 [00:36<00:39, 4903.85 examples/s]Running tokenizer on dataset:  47%|████▋     | 171000/363846 [00:36<00:39, 4904.22 examples/s]Running tokenizer on dataset:  47%|████▋     | 172000/363846 [00:36<00:39, 4897.01 examples/s]Running tokenizer on dataset:  48%|████▊     | 173000/363846 [00:37<00:39, 4889.92 examples/s]Running tokenizer on dataset:  48%|████▊     | 174000/363846 [00:37<00:38, 4905.41 examples/s]Running tokenizer on dataset:  48%|████▊     | 175000/363846 [00:37<00:38, 4913.64 examples/s]Running tokenizer on dataset:  48%|████▊     | 176000/363846 [00:37<00:38, 4913.03 examples/s]Running tokenizer on dataset:  49%|████▊     | 177000/363846 [00:37<00:46, 4045.07 examples/s]Running tokenizer on dataset:  49%|████▉     | 178000/363846 [00:38<00:43, 4263.49 examples/s]Running tokenizer on dataset:  49%|████▉     | 179000/363846 [00:38<00:41, 4429.73 examples/s]Running tokenizer on dataset:  49%|████▉     | 180000/363846 [00:38<00:40, 4558.40 examples/s]Running tokenizer on dataset:  50%|████▉     | 181000/363846 [00:38<00:39, 4655.96 examples/s]Running tokenizer on dataset:  50%|█████     | 182000/363846 [00:39<00:38, 4733.78 examples/s]Running tokenizer on dataset:  50%|█████     | 183000/363846 [00:39<00:37, 4798.53 examples/s]Running tokenizer on dataset:  51%|█████     | 184000/363846 [00:39<00:37, 4847.49 examples/s]Running tokenizer on dataset:  51%|█████     | 185000/363846 [00:39<00:36, 4868.26 examples/s]Running tokenizer on dataset:  51%|█████     | 186000/363846 [00:39<00:36, 4902.01 examples/s]Running tokenizer on dataset:  51%|█████▏    | 187000/363846 [00:40<00:35, 4923.33 examples/s]Running tokenizer on dataset:  52%|█████▏    | 188000/363846 [00:40<00:35, 4899.02 examples/s]Running tokenizer on dataset:  52%|█████▏    | 189000/363846 [00:40<00:35, 4928.09 examples/s]Running tokenizer on dataset:  52%|█████▏    | 190000/363846 [00:40<00:35, 4925.90 examples/s]Running tokenizer on dataset:  52%|█████▏    | 191000/363846 [00:40<00:35, 4914.42 examples/s]Running tokenizer on dataset:  53%|█████▎    | 192000/363846 [00:41<00:35, 4889.81 examples/s]Running tokenizer on dataset:  53%|█████▎    | 193000/363846 [00:41<00:35, 4863.92 examples/s]Running tokenizer on dataset:  53%|█████▎    | 194000/363846 [00:41<00:35, 4814.20 examples/s]Running tokenizer on dataset:  54%|█████▎    | 195000/363846 [00:41<00:44, 3789.04 examples/s]Running tokenizer on dataset:  54%|█████▍    | 196000/363846 [00:42<00:41, 4036.86 examples/s]Running tokenizer on dataset:  54%|█████▍    | 197000/363846 [00:42<00:39, 4261.21 examples/s]Running tokenizer on dataset:  54%|█████▍    | 198000/363846 [00:42<00:37, 4450.57 examples/s]Running tokenizer on dataset:  55%|█████▍    | 199000/363846 [00:42<00:35, 4581.08 examples/s]Running tokenizer on dataset:  55%|█████▍    | 200000/363846 [00:42<00:35, 4670.30 examples/s]Running tokenizer on dataset:  55%|█████▌    | 201000/363846 [00:43<00:34, 4730.09 examples/s]Running tokenizer on dataset:  56%|█████▌    | 202000/363846 [00:43<00:33, 4780.93 examples/s]Running tokenizer on dataset:  56%|█████▌    | 203000/363846 [00:43<00:33, 4817.85 examples/s]Running tokenizer on dataset:  56%|█████▌    | 204000/363846 [00:43<00:32, 4845.96 examples/s]Running tokenizer on dataset:  56%|█████▋    | 205000/363846 [00:43<00:32, 4836.77 examples/s]Running tokenizer on dataset:  57%|█████▋    | 206000/363846 [00:44<00:32, 4835.12 examples/s]Running tokenizer on dataset:  57%|█████▋    | 207000/363846 [00:44<00:32, 4845.81 examples/s]Running tokenizer on dataset:  57%|█████▋    | 208000/363846 [00:44<00:32, 4837.08 examples/s]Running tokenizer on dataset:  57%|█████▋    | 209000/363846 [00:44<00:31, 4840.07 examples/s]Running tokenizer on dataset:  58%|█████▊    | 210000/363846 [00:44<00:31, 4862.69 examples/s]Running tokenizer on dataset:  58%|█████▊    | 211000/363846 [00:45<00:31, 4860.51 examples/s]Running tokenizer on dataset:  58%|█████▊    | 212000/363846 [00:45<00:31, 4831.75 examples/s]Running tokenizer on dataset:  59%|█████▊    | 213000/363846 [00:45<00:31, 4819.61 examples/s]Running tokenizer on dataset:  59%|█████▉    | 214000/363846 [00:45<00:31, 4789.62 examples/s]Running tokenizer on dataset:  59%|█████▉    | 215000/363846 [00:45<00:31, 4775.59 examples/s]Running tokenizer on dataset:  59%|█████▉    | 216000/363846 [00:46<00:37, 3948.55 examples/s]Running tokenizer on dataset:  60%|█████▉    | 217000/363846 [00:46<00:35, 4173.76 examples/s]Running tokenizer on dataset:  60%|█████▉    | 218000/363846 [00:46<00:33, 4330.84 examples/s]Running tokenizer on dataset:  60%|██████    | 219000/363846 [00:46<00:32, 4460.48 examples/s]Running tokenizer on dataset:  60%|██████    | 220000/363846 [00:47<00:31, 4565.02 examples/s]Running tokenizer on dataset:  61%|██████    | 221000/363846 [00:47<00:30, 4659.47 examples/s]Running tokenizer on dataset:  61%|██████    | 222000/363846 [00:47<00:29, 4742.87 examples/s]Running tokenizer on dataset:  61%|██████▏   | 223000/363846 [00:47<00:29, 4775.33 examples/s]Running tokenizer on dataset:  62%|██████▏   | 224000/363846 [00:47<00:29, 4813.66 examples/s]Running tokenizer on dataset:  62%|██████▏   | 225000/363846 [00:48<00:28, 4818.10 examples/s]Running tokenizer on dataset:  62%|██████▏   | 226000/363846 [00:48<00:28, 4793.52 examples/s]Running tokenizer on dataset:  62%|██████▏   | 227000/363846 [00:48<00:28, 4817.82 examples/s]Running tokenizer on dataset:  63%|██████▎   | 228000/363846 [00:48<00:28, 4822.21 examples/s]Running tokenizer on dataset:  63%|██████▎   | 229000/363846 [00:49<00:28, 4803.24 examples/s]Running tokenizer on dataset:  63%|██████▎   | 230000/363846 [00:49<00:27, 4808.26 examples/s]Running tokenizer on dataset:  63%|██████▎   | 231000/363846 [00:49<00:27, 4836.08 examples/s]Running tokenizer on dataset:  64%|██████▍   | 232000/363846 [00:49<00:26, 4887.78 examples/s]Running tokenizer on dataset:  64%|██████▍   | 233000/363846 [00:49<00:26, 4905.66 examples/s]Running tokenizer on dataset:  64%|██████▍   | 234000/363846 [00:50<00:26, 4909.78 examples/s]Running tokenizer on dataset:  65%|██████▍   | 235000/363846 [00:50<00:33, 3842.57 examples/s]Running tokenizer on dataset:  65%|██████▍   | 236000/363846 [00:50<00:31, 4108.30 examples/s]Running tokenizer on dataset:  65%|██████▌   | 237000/363846 [00:50<00:29, 4321.77 examples/s]Running tokenizer on dataset:  65%|██████▌   | 238000/363846 [00:51<00:28, 4493.72 examples/s]Running tokenizer on dataset:  66%|██████▌   | 239000/363846 [00:51<00:27, 4609.48 examples/s]Running tokenizer on dataset:  66%|██████▌   | 240000/363846 [00:51<00:26, 4657.86 examples/s]Running tokenizer on dataset:  66%|██████▌   | 241000/363846 [00:51<00:26, 4701.98 examples/s]Running tokenizer on dataset:  67%|██████▋   | 242000/363846 [00:51<00:25, 4783.45 examples/s]Running tokenizer on dataset:  67%|██████▋   | 243000/363846 [00:52<00:25, 4807.79 examples/s]Running tokenizer on dataset:  67%|██████▋   | 244000/363846 [00:52<00:24, 4822.88 examples/s]Running tokenizer on dataset:  67%|██████▋   | 245000/363846 [00:52<00:24, 4835.34 examples/s]Running tokenizer on dataset:  68%|██████▊   | 246000/363846 [00:52<00:24, 4865.18 examples/s]Running tokenizer on dataset:  68%|██████▊   | 247000/363846 [00:52<00:23, 4876.74 examples/s]Running tokenizer on dataset:  68%|██████▊   | 248000/363846 [00:53<00:23, 4888.47 examples/s]Running tokenizer on dataset:  68%|██████▊   | 249000/363846 [00:53<00:23, 4904.86 examples/s]Running tokenizer on dataset:  69%|██████▊   | 250000/363846 [00:53<00:23, 4913.22 examples/s]Running tokenizer on dataset:  69%|██████▉   | 251000/363846 [00:53<00:23, 4902.62 examples/s]Running tokenizer on dataset:  69%|██████▉   | 252000/363846 [00:53<00:22, 4884.54 examples/s]Running tokenizer on dataset:  70%|██████▉   | 253000/363846 [00:54<00:29, 3802.95 examples/s]Running tokenizer on dataset:  70%|██████▉   | 254000/363846 [00:54<00:27, 4060.20 examples/s]Running tokenizer on dataset:  70%|███████   | 255000/363846 [00:54<00:25, 4259.36 examples/s]Running tokenizer on dataset:  70%|███████   | 256000/363846 [00:54<00:24, 4423.41 examples/s]Running tokenizer on dataset:  71%|███████   | 257000/363846 [00:55<00:23, 4499.64 examples/s]Running tokenizer on dataset:  71%|███████   | 258000/363846 [00:55<00:23, 4587.41 examples/s]Running tokenizer on dataset:  71%|███████   | 259000/363846 [00:55<00:22, 4679.27 examples/s]Running tokenizer on dataset:  71%|███████▏  | 260000/363846 [00:55<00:21, 4735.59 examples/s]Running tokenizer on dataset:  72%|███████▏  | 261000/363846 [00:55<00:21, 4796.15 examples/s]Running tokenizer on dataset:  72%|███████▏  | 262000/363846 [00:56<00:21, 4806.36 examples/s]Running tokenizer on dataset:  72%|███████▏  | 263000/363846 [00:56<00:20, 4812.38 examples/s]Running tokenizer on dataset:  73%|███████▎  | 264000/363846 [00:56<00:20, 4787.16 examples/s]Running tokenizer on dataset:  73%|███████▎  | 265000/363846 [00:56<00:20, 4806.81 examples/s]Running tokenizer on dataset:  73%|███████▎  | 266000/363846 [00:57<00:20, 4797.63 examples/s]Running tokenizer on dataset:  73%|███████▎  | 267000/363846 [00:57<00:19, 4848.82 examples/s]Running tokenizer on dataset:  74%|███████▎  | 268000/363846 [00:57<00:19, 4867.97 examples/s]Running tokenizer on dataset:  74%|███████▍  | 269000/363846 [00:57<00:19, 4888.32 examples/s]Running tokenizer on dataset:  74%|███████▍  | 270000/363846 [00:57<00:19, 4875.19 examples/s]Running tokenizer on dataset:  74%|███████▍  | 271000/363846 [00:58<00:19, 4881.45 examples/s]Running tokenizer on dataset:  75%|███████▍  | 272000/363846 [00:58<00:18, 4878.06 examples/s]Running tokenizer on dataset:  75%|███████▌  | 273000/363846 [00:58<00:18, 4810.36 examples/s]Running tokenizer on dataset:  75%|███████▌  | 274000/363846 [00:58<00:24, 3721.46 examples/s]Running tokenizer on dataset:  76%|███████▌  | 275000/363846 [00:59<00:22, 4001.52 examples/s]Running tokenizer on dataset:  76%|███████▌  | 276000/363846 [00:59<00:20, 4231.44 examples/s]Running tokenizer on dataset:  76%|███████▌  | 277000/363846 [00:59<00:19, 4373.90 examples/s]Running tokenizer on dataset:  76%|███████▋  | 278000/363846 [00:59<00:18, 4543.54 examples/s]Running tokenizer on dataset:  77%|███████▋  | 279000/363846 [00:59<00:18, 4643.87 examples/s]Running tokenizer on dataset:  77%|███████▋  | 280000/363846 [01:00<00:17, 4718.88 examples/s]Running tokenizer on dataset:  77%|███████▋  | 281000/363846 [01:00<00:17, 4751.82 examples/s]Running tokenizer on dataset:  78%|███████▊  | 282000/363846 [01:00<00:17, 4766.49 examples/s]Running tokenizer on dataset:  78%|███████▊  | 283000/363846 [01:00<00:16, 4828.05 examples/s]Running tokenizer on dataset:  78%|███████▊  | 284000/363846 [01:00<00:16, 4827.21 examples/s]Running tokenizer on dataset:  78%|███████▊  | 285000/363846 [01:01<00:16, 4819.75 examples/s]Running tokenizer on dataset:  79%|███████▊  | 286000/363846 [01:01<00:16, 4842.23 examples/s]Running tokenizer on dataset:  79%|███████▉  | 287000/363846 [01:01<00:15, 4855.08 examples/s]Running tokenizer on dataset:  79%|███████▉  | 288000/363846 [01:01<00:15, 4864.60 examples/s]Running tokenizer on dataset:  79%|███████▉  | 289000/363846 [01:01<00:15, 4834.98 examples/s]Running tokenizer on dataset:  80%|███████▉  | 290000/363846 [01:02<00:15, 4857.95 examples/s]Running tokenizer on dataset:  80%|███████▉  | 291000/363846 [01:02<00:14, 4900.13 examples/s]Running tokenizer on dataset:  80%|████████  | 292000/363846 [01:02<00:14, 4840.81 examples/s]Running tokenizer on dataset:  81%|████████  | 293000/363846 [01:02<00:18, 3904.26 examples/s]Running tokenizer on dataset:  81%|████████  | 294000/363846 [01:03<00:16, 4150.38 examples/s]Running tokenizer on dataset:  81%|████████  | 295000/363846 [01:03<00:15, 4362.26 examples/s]Running tokenizer on dataset:  81%|████████▏ | 296000/363846 [01:03<00:14, 4526.27 examples/s]Running tokenizer on dataset:  82%|████████▏ | 297000/363846 [01:03<00:14, 4638.73 examples/s]Running tokenizer on dataset:  82%|████████▏ | 298000/363846 [01:03<00:13, 4718.90 examples/s]Running tokenizer on dataset:  82%|████████▏ | 299000/363846 [01:04<00:13, 4741.23 examples/s]Running tokenizer on dataset:  82%|████████▏ | 300000/363846 [01:04<00:13, 4782.62 examples/s]Running tokenizer on dataset:  83%|████████▎ | 301000/363846 [01:04<00:13, 4774.80 examples/s]Running tokenizer on dataset:  83%|████████▎ | 302000/363846 [01:04<00:12, 4805.14 examples/s]Running tokenizer on dataset:  83%|████████▎ | 303000/363846 [01:04<00:12, 4848.87 examples/s]Running tokenizer on dataset:  84%|████████▎ | 304000/363846 [01:05<00:12, 4887.95 examples/s]Running tokenizer on dataset:  84%|████████▍ | 305000/363846 [01:05<00:11, 4909.35 examples/s]Running tokenizer on dataset:  84%|████████▍ | 306000/363846 [01:05<00:11, 4919.30 examples/s]Running tokenizer on dataset:  84%|████████▍ | 307000/363846 [01:05<00:11, 4927.02 examples/s]Running tokenizer on dataset:  85%|████████▍ | 308000/363846 [01:05<00:11, 4914.78 examples/s]Running tokenizer on dataset:  85%|████████▍ | 309000/363846 [01:06<00:11, 4909.44 examples/s]Running tokenizer on dataset:  85%|████████▌ | 310000/363846 [01:06<00:10, 4910.10 examples/s]Running tokenizer on dataset:  85%|████████▌ | 311000/363846 [01:06<00:13, 3825.55 examples/s]Running tokenizer on dataset:  86%|████████▌ | 312000/363846 [01:07<00:12, 4076.08 examples/s]Running tokenizer on dataset:  86%|████████▌ | 313000/363846 [01:07<00:11, 4240.80 examples/s]Running tokenizer on dataset:  86%|████████▋ | 314000/363846 [01:07<00:11, 4446.20 examples/s]Running tokenizer on dataset:  87%|████████▋ | 315000/363846 [01:07<00:10, 4595.84 examples/s]Running tokenizer on dataset:  87%|████████▋ | 316000/363846 [01:07<00:10, 4679.87 examples/s]Running tokenizer on dataset:  87%|████████▋ | 317000/363846 [01:08<00:09, 4748.89 examples/s]Running tokenizer on dataset:  87%|████████▋ | 318000/363846 [01:08<00:09, 4791.74 examples/s]Running tokenizer on dataset:  88%|████████▊ | 319000/363846 [01:08<00:09, 4835.39 examples/s]Running tokenizer on dataset:  88%|████████▊ | 320000/363846 [01:08<00:09, 4851.64 examples/s]Running tokenizer on dataset:  88%|████████▊ | 321000/363846 [01:08<00:08, 4865.87 examples/s]Running tokenizer on dataset:  88%|████████▊ | 322000/363846 [01:09<00:08, 4870.80 examples/s]Running tokenizer on dataset:  89%|████████▉ | 323000/363846 [01:09<00:08, 4844.28 examples/s]Running tokenizer on dataset:  89%|████████▉ | 324000/363846 [01:09<00:08, 4856.19 examples/s]Running tokenizer on dataset:  89%|████████▉ | 325000/363846 [01:09<00:07, 4862.00 examples/s]Running tokenizer on dataset:  90%|████████▉ | 326000/363846 [01:09<00:07, 4872.49 examples/s]Running tokenizer on dataset:  90%|████████▉ | 327000/363846 [01:10<00:07, 4888.20 examples/s]Running tokenizer on dataset:  90%|█████████ | 328000/363846 [01:10<00:07, 4891.74 examples/s]Running tokenizer on dataset:  90%|█████████ | 329000/363846 [01:10<00:07, 4877.08 examples/s]Running tokenizer on dataset:  91%|█████████ | 330000/363846 [01:10<00:06, 4896.58 examples/s]Running tokenizer on dataset:  91%|█████████ | 331000/363846 [01:10<00:06, 4881.28 examples/s]Running tokenizer on dataset:  91%|█████████ | 332000/363846 [01:11<00:08, 3829.82 examples/s]Running tokenizer on dataset:  92%|█████████▏| 333000/363846 [01:11<00:07, 4107.65 examples/s]Running tokenizer on dataset:  92%|█████████▏| 334000/363846 [01:11<00:06, 4293.24 examples/s]Running tokenizer on dataset:  92%|█████████▏| 335000/363846 [01:11<00:06, 4430.16 examples/s]Running tokenizer on dataset:  92%|█████████▏| 336000/363846 [01:12<00:06, 4552.52 examples/s]Running tokenizer on dataset:  93%|█████████▎| 337000/363846 [01:12<00:05, 4662.92 examples/s]Running tokenizer on dataset:  93%|█████████▎| 338000/363846 [01:12<00:05, 4710.79 examples/s]Running tokenizer on dataset:  93%|█████████▎| 339000/363846 [01:12<00:05, 4783.56 examples/s]Running tokenizer on dataset:  93%|█████████▎| 340000/363846 [01:12<00:05, 4748.19 examples/s]Running tokenizer on dataset:  94%|█████████▎| 341000/363846 [01:13<00:04, 4796.98 examples/s]Running tokenizer on dataset:  94%|█████████▍| 342000/363846 [01:13<00:04, 4835.70 examples/s]Running tokenizer on dataset:  94%|█████████▍| 343000/363846 [01:13<00:04, 4859.69 examples/s]Running tokenizer on dataset:  95%|█████████▍| 344000/363846 [01:13<00:04, 4852.42 examples/s]Running tokenizer on dataset:  95%|█████████▍| 345000/363846 [01:13<00:03, 4875.37 examples/s]Running tokenizer on dataset:  95%|█████████▌| 346000/363846 [01:14<00:03, 4878.40 examples/s]Running tokenizer on dataset:  95%|█████████▌| 347000/363846 [01:14<00:03, 4859.76 examples/s]Running tokenizer on dataset:  96%|█████████▌| 348000/363846 [01:14<00:03, 4910.55 examples/s]Running tokenizer on dataset:  96%|█████████▌| 349000/363846 [01:14<00:03, 4943.34 examples/s]Running tokenizer on dataset:  96%|█████████▌| 350000/363846 [01:14<00:02, 4945.01 examples/s]Running tokenizer on dataset:  96%|█████████▋| 351000/363846 [01:15<00:03, 4011.73 examples/s]Running tokenizer on dataset:  97%|█████████▋| 352000/363846 [01:15<00:02, 4263.55 examples/s]Running tokenizer on dataset:  97%|█████████▋| 353000/363846 [01:15<00:02, 4418.42 examples/s]Running tokenizer on dataset:  97%|█████████▋| 354000/363846 [01:15<00:02, 4558.82 examples/s]Running tokenizer on dataset:  98%|█████████▊| 355000/363846 [01:16<00:01, 4640.65 examples/s]Running tokenizer on dataset:  98%|█████████▊| 356000/363846 [01:16<00:01, 4706.68 examples/s]Running tokenizer on dataset:  98%|█████████▊| 357000/363846 [01:16<00:01, 4744.20 examples/s]Running tokenizer on dataset:  98%|█████████▊| 358000/363846 [01:16<00:01, 4795.01 examples/s]Running tokenizer on dataset:  99%|█████████▊| 359000/363846 [01:16<00:01, 4805.37 examples/s]Running tokenizer on dataset:  99%|█████████▉| 360000/363846 [01:17<00:00, 4842.82 examples/s]Running tokenizer on dataset:  99%|█████████▉| 361000/363846 [01:17<00:00, 4875.12 examples/s]Running tokenizer on dataset:  99%|█████████▉| 362000/363846 [01:17<00:00, 4906.45 examples/s]Running tokenizer on dataset: 100%|█████████▉| 363000/363846 [01:17<00:00, 4915.55 examples/s]Running tokenizer on dataset: 100%|██████████| 363846/363846 [01:17<00:00, 4905.00 examples/s]                                                                                              Running tokenizer on dataset:   0%|          | 0/40430 [00:00<?, ? examples/s]Running tokenizer on dataset:   2%|▏         | 1000/40430 [00:00<00:07, 4970.40 examples/s]Running tokenizer on dataset:   5%|▍         | 2000/40430 [00:00<00:07, 5001.36 examples/s]Running tokenizer on dataset:   7%|▋         | 3000/40430 [00:00<00:07, 4925.38 examples/s]Running tokenizer on dataset:  10%|▉         | 4000/40430 [00:00<00:09, 3714.75 examples/s]Running tokenizer on dataset:  12%|█▏        | 5000/40430 [00:01<00:08, 4033.05 examples/s]Running tokenizer on dataset:  15%|█▍        | 6000/40430 [00:01<00:08, 4295.65 examples/s]Running tokenizer on dataset:  17%|█▋        | 7000/40430 [00:01<00:07, 4470.31 examples/s]Running tokenizer on dataset:  20%|█▉        | 8000/40430 [00:01<00:06, 4639.46 examples/s]Running tokenizer on dataset:  22%|██▏       | 9000/40430 [00:02<00:06, 4673.71 examples/s]Running tokenizer on dataset:  25%|██▍       | 10000/40430 [00:02<00:06, 4728.10 examples/s]Running tokenizer on dataset:  27%|██▋       | 11000/40430 [00:02<00:06, 4779.23 examples/s]Running tokenizer on dataset:  30%|██▉       | 12000/40430 [00:02<00:05, 4814.52 examples/s]Running tokenizer on dataset:  32%|███▏      | 13000/40430 [00:02<00:05, 4809.49 examples/s]Running tokenizer on dataset:  35%|███▍      | 14000/40430 [00:03<00:05, 4826.70 examples/s]Running tokenizer on dataset:  37%|███▋      | 15000/40430 [00:03<00:05, 4822.35 examples/s]Running tokenizer on dataset:  40%|███▉      | 16000/40430 [00:03<00:05, 4822.92 examples/s]Running tokenizer on dataset:  42%|████▏     | 17000/40430 [00:03<00:04, 4794.59 examples/s]Running tokenizer on dataset:  45%|████▍     | 18000/40430 [00:03<00:04, 4845.47 examples/s]Running tokenizer on dataset:  47%|████▋     | 19000/40430 [00:04<00:04, 4872.53 examples/s]Running tokenizer on dataset:  49%|████▉     | 20000/40430 [00:04<00:04, 4924.87 examples/s]Running tokenizer on dataset:  52%|█████▏    | 21000/40430 [00:04<00:03, 4905.83 examples/s]Running tokenizer on dataset:  54%|█████▍    | 22000/40430 [00:04<00:03, 4890.78 examples/s]Running tokenizer on dataset:  57%|█████▋    | 23000/40430 [00:04<00:03, 4854.88 examples/s]Running tokenizer on dataset:  59%|█████▉    | 24000/40430 [00:05<00:03, 4879.80 examples/s]Running tokenizer on dataset:  62%|██████▏   | 25000/40430 [00:05<00:03, 3922.16 examples/s]Running tokenizer on dataset:  64%|██████▍   | 26000/40430 [00:05<00:03, 4161.46 examples/s]Running tokenizer on dataset:  67%|██████▋   | 27000/40430 [00:05<00:03, 4350.62 examples/s]Running tokenizer on dataset:  69%|██████▉   | 28000/40430 [00:06<00:02, 4460.15 examples/s]Running tokenizer on dataset:  72%|███████▏  | 29000/40430 [00:06<00:02, 4586.66 examples/s]Running tokenizer on dataset:  74%|███████▍  | 30000/40430 [00:06<00:02, 4672.52 examples/s]Running tokenizer on dataset:  77%|███████▋  | 31000/40430 [00:06<00:01, 4727.10 examples/s]Running tokenizer on dataset:  79%|███████▉  | 32000/40430 [00:06<00:01, 4785.26 examples/s]Running tokenizer on dataset:  82%|████████▏ | 33000/40430 [00:07<00:01, 4798.30 examples/s]Running tokenizer on dataset:  84%|████████▍ | 34000/40430 [00:07<00:01, 4788.09 examples/s]Running tokenizer on dataset:  87%|████████▋ | 35000/40430 [00:07<00:01, 4837.80 examples/s]Running tokenizer on dataset:  89%|████████▉ | 36000/40430 [00:07<00:00, 4887.83 examples/s]Running tokenizer on dataset:  92%|█████████▏| 37000/40430 [00:07<00:00, 4891.36 examples/s]Running tokenizer on dataset:  94%|█████████▍| 38000/40430 [00:08<00:00, 4869.39 examples/s]Running tokenizer on dataset:  96%|█████████▋| 39000/40430 [00:08<00:00, 4839.84 examples/s]Running tokenizer on dataset:  99%|█████████▉| 40000/40430 [00:08<00:00, 4854.17 examples/s]                                                                                            Running tokenizer on dataset:   0%|          | 0/390965 [00:00<?, ? examples/s]Running tokenizer on dataset:   0%|          | 1000/390965 [00:00<02:18, 2821.96 examples/s]Running tokenizer on dataset:   1%|          | 2000/390965 [00:00<01:43, 3745.76 examples/s]Running tokenizer on dataset:   1%|          | 3000/390965 [00:00<01:33, 4162.75 examples/s]Running tokenizer on dataset:   1%|          | 4000/390965 [00:00<01:27, 4420.96 examples/s]Running tokenizer on dataset:   1%|▏         | 5000/390965 [00:01<01:23, 4600.81 examples/s]Running tokenizer on dataset:   2%|▏         | 6000/390965 [00:01<01:22, 4680.34 examples/s]Running tokenizer on dataset:   2%|▏         | 7000/390965 [00:01<01:22, 4681.36 examples/s]Running tokenizer on dataset:   2%|▏         | 8000/390965 [00:01<01:20, 4737.39 examples/s]Running tokenizer on dataset:   2%|▏         | 9000/390965 [00:02<01:19, 4782.48 examples/s]Running tokenizer on dataset:   3%|▎         | 10000/390965 [00:02<01:19, 4807.35 examples/s]Running tokenizer on dataset:   3%|▎         | 11000/390965 [00:02<01:18, 4832.43 examples/s]Running tokenizer on dataset:   3%|▎         | 12000/390965 [00:02<01:18, 4801.83 examples/s]Running tokenizer on dataset:   3%|▎         | 13000/390965 [00:02<01:18, 4819.35 examples/s]Running tokenizer on dataset:   4%|▎         | 14000/390965 [00:03<01:17, 4856.58 examples/s]Running tokenizer on dataset:   4%|▍         | 15000/390965 [00:03<01:18, 4811.95 examples/s]Running tokenizer on dataset:   4%|▍         | 16000/390965 [00:03<01:17, 4841.75 examples/s]Running tokenizer on dataset:   4%|▍         | 17000/390965 [00:03<01:17, 4849.25 examples/s]Running tokenizer on dataset:   5%|▍         | 18000/390965 [00:03<01:16, 4862.60 examples/s]Running tokenizer on dataset:   5%|▍         | 19000/390965 [00:04<01:16, 4873.01 examples/s]Running tokenizer on dataset:   5%|▌         | 20000/390965 [00:04<01:16, 4873.62 examples/s]Running tokenizer on dataset:   5%|▌         | 21000/390965 [00:04<01:33, 3956.36 examples/s]Running tokenizer on dataset:   6%|▌         | 22000/390965 [00:04<01:27, 4201.90 examples/s]Running tokenizer on dataset:   6%|▌         | 23000/390965 [00:05<01:23, 4404.43 examples/s]Running tokenizer on dataset:   6%|▌         | 24000/390965 [00:05<01:20, 4535.84 examples/s]Running tokenizer on dataset:   6%|▋         | 25000/390965 [00:05<01:19, 4602.07 examples/s]Running tokenizer on dataset:   7%|▋         | 26000/390965 [00:05<01:18, 4650.79 examples/s]Running tokenizer on dataset:   7%|▋         | 27000/390965 [00:05<01:17, 4689.63 examples/s]Running tokenizer on dataset:   7%|▋         | 28000/390965 [00:06<01:16, 4736.85 examples/s]Running tokenizer on dataset:   7%|▋         | 29000/390965 [00:06<01:15, 4797.28 examples/s]Running tokenizer on dataset:   8%|▊         | 30000/390965 [00:06<01:14, 4839.58 examples/s]Running tokenizer on dataset:   8%|▊         | 31000/390965 [00:06<01:14, 4832.81 examples/s]Running tokenizer on dataset:   8%|▊         | 32000/390965 [00:06<01:14, 4825.65 examples/s]Running tokenizer on dataset:   8%|▊         | 33000/390965 [00:07<01:14, 4832.57 examples/s]Running tokenizer on dataset:   9%|▊         | 34000/390965 [00:07<01:13, 4829.16 examples/s]Running tokenizer on dataset:   9%|▉         | 35000/390965 [00:07<01:13, 4829.65 examples/s]Running tokenizer on dataset:   9%|▉         | 36000/390965 [00:07<01:13, 4839.09 examples/s]Running tokenizer on dataset:   9%|▉         | 37000/390965 [00:07<01:12, 4855.07 examples/s]Running tokenizer on dataset:  10%|▉         | 38000/390965 [00:08<01:29, 3925.18 examples/s]Running tokenizer on dataset:  10%|▉         | 39000/390965 [00:08<01:24, 4172.76 examples/s]Running tokenizer on dataset:  10%|█         | 40000/390965 [00:08<01:20, 4380.46 examples/s]Running tokenizer on dataset:  10%|█         | 41000/390965 [00:08<01:17, 4537.61 examples/s]Running tokenizer on dataset:  11%|█         | 42000/390965 [00:09<01:15, 4649.17 examples/s]Running tokenizer on dataset:  11%|█         | 43000/390965 [00:09<01:13, 4718.03 examples/s]Running tokenizer on dataset:  11%|█▏        | 44000/390965 [00:09<01:12, 4778.18 examples/s]Running tokenizer on dataset:  12%|█▏        | 45000/390965 [00:09<01:12, 4803.54 examples/s]Running tokenizer on dataset:  12%|█▏        | 46000/390965 [00:09<01:11, 4832.65 examples/s]Running tokenizer on dataset:  12%|█▏        | 47000/390965 [00:10<01:10, 4852.39 examples/s]Running tokenizer on dataset:  12%|█▏        | 48000/390965 [00:10<01:10, 4863.42 examples/s]Running tokenizer on dataset:  13%|█▎        | 49000/390965 [00:10<01:10, 4872.93 examples/s]Running tokenizer on dataset:  13%|█▎        | 50000/390965 [00:10<01:09, 4877.83 examples/s]Running tokenizer on dataset:  13%|█▎        | 51000/390965 [00:10<01:09, 4882.44 examples/s]Running tokenizer on dataset:  13%|█▎        | 52000/390965 [00:11<01:09, 4858.32 examples/s]Running tokenizer on dataset:  14%|█▎        | 53000/390965 [00:11<01:09, 4876.14 examples/s]Running tokenizer on dataset:  14%|█▍        | 54000/390965 [00:11<01:08, 4885.50 examples/s]Running tokenizer on dataset:  14%|█▍        | 55000/390965 [00:11<01:08, 4889.77 examples/s]Running tokenizer on dataset:  14%|█▍        | 56000/390965 [00:11<01:08, 4888.60 examples/s]Running tokenizer on dataset:  15%|█▍        | 57000/390965 [00:12<01:08, 4872.57 examples/s]Running tokenizer on dataset:  15%|█▍        | 58000/390965 [00:12<01:08, 4871.57 examples/s]Running tokenizer on dataset:  15%|█▌        | 59000/390965 [00:12<01:23, 3970.25 examples/s]Running tokenizer on dataset:  15%|█▌        | 60000/390965 [00:12<01:18, 4198.40 examples/s]Running tokenizer on dataset:  16%|█▌        | 61000/390965 [00:13<01:15, 4352.95 examples/s]Running tokenizer on dataset:  16%|█▌        | 62000/390965 [00:13<01:13, 4499.31 examples/s]Running tokenizer on dataset:  16%|█▌        | 63000/390965 [00:13<01:10, 4621.61 examples/s]Running tokenizer on dataset:  16%|█▋        | 64000/390965 [00:13<01:09, 4715.80 examples/s]Running tokenizer on dataset:  17%|█▋        | 65000/390965 [00:13<01:08, 4766.76 examples/s]Running tokenizer on dataset:  17%|█▋        | 66000/390965 [00:14<01:07, 4810.49 examples/s]Running tokenizer on dataset:  17%|█▋        | 67000/390965 [00:14<01:06, 4846.23 examples/s]Running tokenizer on dataset:  17%|█▋        | 68000/390965 [00:14<01:06, 4862.20 examples/s]Running tokenizer on dataset:  18%|█▊        | 69000/390965 [00:14<01:06, 4867.21 examples/s]Running tokenizer on dataset:  18%|█▊        | 70000/390965 [00:14<01:06, 4861.61 examples/s]Running tokenizer on dataset:  18%|█▊        | 71000/390965 [00:15<01:05, 4869.20 examples/s]Running tokenizer on dataset:  18%|█▊        | 72000/390965 [00:15<01:05, 4883.10 examples/s]Running tokenizer on dataset:  19%|█▊        | 73000/390965 [00:15<01:05, 4889.97 examples/s]Running tokenizer on dataset:  19%|█▉        | 74000/390965 [00:15<01:04, 4883.57 examples/s]Running tokenizer on dataset:  19%|█▉        | 75000/390965 [00:16<01:04, 4889.82 examples/s]Running tokenizer on dataset:  19%|█▉        | 76000/390965 [00:16<01:04, 4890.60 examples/s]Running tokenizer on dataset:  20%|█▉        | 77000/390965 [00:16<01:04, 4904.09 examples/s]Running tokenizer on dataset:  20%|█▉        | 78000/390965 [00:16<01:03, 4902.57 examples/s]Running tokenizer on dataset:  20%|██        | 79000/390965 [00:16<01:18, 3963.05 examples/s]Running tokenizer on dataset:  20%|██        | 80000/390965 [00:17<01:14, 4201.90 examples/s]Running tokenizer on dataset:  21%|██        | 81000/390965 [00:17<01:10, 4395.60 examples/s]Running tokenizer on dataset:  21%|██        | 82000/390965 [00:17<01:08, 4519.00 examples/s]Running tokenizer on dataset:  21%|██        | 83000/390965 [00:17<01:06, 4614.52 examples/s]Running tokenizer on dataset:  21%|██▏       | 84000/390965 [00:18<01:05, 4688.35 examples/s]Running tokenizer on dataset:  22%|██▏       | 85000/390965 [00:18<01:04, 4734.99 examples/s]Running tokenizer on dataset:  22%|██▏       | 86000/390965 [00:18<01:03, 4802.28 examples/s]Running tokenizer on dataset:  22%|██▏       | 87000/390965 [00:18<01:02, 4854.40 examples/s]Running tokenizer on dataset:  23%|██▎       | 88000/390965 [00:18<01:02, 4847.08 examples/s]Running tokenizer on dataset:  23%|██▎       | 89000/390965 [00:19<01:02, 4863.25 examples/s]Running tokenizer on dataset:  23%|██▎       | 90000/390965 [00:19<01:01, 4858.99 examples/s]Running tokenizer on dataset:  23%|██▎       | 91000/390965 [00:19<01:01, 4884.84 examples/s]Running tokenizer on dataset:  24%|██▎       | 92000/390965 [00:19<01:01, 4895.14 examples/s]Running tokenizer on dataset:  24%|██▍       | 93000/390965 [00:19<01:00, 4893.72 examples/s]Running tokenizer on dataset:  24%|██▍       | 94000/390965 [00:20<01:00, 4882.88 examples/s]Running tokenizer on dataset:  24%|██▍       | 95000/390965 [00:20<01:00, 4892.62 examples/s]Running tokenizer on dataset:  25%|██▍       | 96000/390965 [00:20<01:12, 4058.64 examples/s]Running tokenizer on dataset:  25%|██▍       | 97000/390965 [00:20<01:08, 4300.00 examples/s]Running tokenizer on dataset:  25%|██▌       | 98000/390965 [00:21<01:05, 4472.01 examples/s]Running tokenizer on dataset:  25%|██▌       | 99000/390965 [00:21<01:03, 4604.53 examples/s]Running tokenizer on dataset:  26%|██▌       | 100000/390965 [00:21<01:01, 4710.15 examples/s]Running tokenizer on dataset:  26%|██▌       | 101000/390965 [00:21<01:00, 4758.03 examples/s]Running tokenizer on dataset:  26%|██▌       | 102000/390965 [00:21<01:00, 4764.93 examples/s]Running tokenizer on dataset:  26%|██▋       | 103000/390965 [00:22<01:00, 4791.32 examples/s]Running tokenizer on dataset:  27%|██▋       | 104000/390965 [00:22<00:59, 4810.66 examples/s]Running tokenizer on dataset:  27%|██▋       | 105000/390965 [00:22<00:59, 4826.73 examples/s]Running tokenizer on dataset:  27%|██▋       | 106000/390965 [00:22<00:59, 4750.17 examples/s]Running tokenizer on dataset:  27%|██▋       | 107000/390965 [00:22<00:59, 4764.36 examples/s]Running tokenizer on dataset:  28%|██▊       | 108000/390965 [00:23<00:59, 4775.47 examples/s]Running tokenizer on dataset:  28%|██▊       | 109000/390965 [00:23<00:58, 4825.88 examples/s]Running tokenizer on dataset:  28%|██▊       | 110000/390965 [00:23<00:57, 4868.28 examples/s]Running tokenizer on dataset:  28%|██▊       | 111000/390965 [00:23<00:57, 4895.73 examples/s]Running tokenizer on dataset:  29%|██▊       | 112000/390965 [00:23<00:56, 4899.41 examples/s]Running tokenizer on dataset:  29%|██▉       | 113000/390965 [00:24<00:56, 4888.09 examples/s]Running tokenizer on dataset:  29%|██▉       | 114000/390965 [00:24<00:56, 4862.37 examples/s]Running tokenizer on dataset:  29%|██▉       | 115000/390965 [00:24<00:56, 4877.07 examples/s]Running tokenizer on dataset:  30%|██▉       | 116000/390965 [00:24<00:56, 4884.47 examples/s]Running tokenizer on dataset:  30%|██▉       | 117000/390965 [00:25<01:10, 3911.30 examples/s]Running tokenizer on dataset:  30%|███       | 118000/390965 [00:25<01:05, 4154.24 examples/s]Running tokenizer on dataset:  30%|███       | 119000/390965 [00:25<01:02, 4342.32 examples/s]Running tokenizer on dataset:  31%|███       | 120000/390965 [00:25<01:00, 4507.91 examples/s]Running tokenizer on dataset:  31%|███       | 121000/390965 [00:25<00:58, 4599.32 examples/s]Running tokenizer on dataset:  31%|███       | 122000/390965 [00:26<00:57, 4697.66 examples/s]Running tokenizer on dataset:  31%|███▏      | 123000/390965 [00:26<00:56, 4755.62 examples/s]Running tokenizer on dataset:  32%|███▏      | 124000/390965 [00:26<00:55, 4791.06 examples/s]Running tokenizer on dataset:  32%|███▏      | 125000/390965 [00:26<00:55, 4800.66 examples/s]Running tokenizer on dataset:  32%|███▏      | 126000/390965 [00:26<00:55, 4815.02 examples/s]Running tokenizer on dataset:  32%|███▏      | 127000/390965 [00:27<00:54, 4829.48 examples/s]Running tokenizer on dataset:  33%|███▎      | 128000/390965 [00:27<00:54, 4837.47 examples/s]Running tokenizer on dataset:  33%|███▎      | 129000/390965 [00:27<00:54, 4830.73 examples/s]Running tokenizer on dataset:  33%|███▎      | 130000/390965 [00:27<00:54, 4823.87 examples/s]Running tokenizer on dataset:  34%|███▎      | 131000/390965 [00:27<00:53, 4834.59 examples/s]Running tokenizer on dataset:  34%|███▍      | 132000/390965 [00:28<00:53, 4853.34 examples/s]Running tokenizer on dataset:  34%|███▍      | 133000/390965 [00:28<00:53, 4864.68 examples/s]Running tokenizer on dataset:  34%|███▍      | 134000/390965 [00:28<00:53, 4833.83 examples/s]Running tokenizer on dataset:  35%|███▍      | 135000/390965 [00:28<00:52, 4837.98 examples/s]Running tokenizer on dataset:  35%|███▍      | 136000/390965 [00:28<00:52, 4849.28 examples/s]Running tokenizer on dataset:  35%|███▌      | 137000/390965 [00:29<01:03, 4013.06 examples/s]Running tokenizer on dataset:  35%|███▌      | 138000/390965 [00:29<00:59, 4221.01 examples/s]Running tokenizer on dataset:  36%|███▌      | 139000/390965 [00:29<00:57, 4394.77 examples/s]Running tokenizer on dataset:  36%|███▌      | 140000/390965 [00:29<00:55, 4513.87 examples/s]Running tokenizer on dataset:  36%|███▌      | 141000/390965 [00:30<00:54, 4610.31 examples/s]Running tokenizer on dataset:  36%|███▋      | 142000/390965 [00:30<00:53, 4667.89 examples/s]Running tokenizer on dataset:  37%|███▋      | 143000/390965 [00:30<00:52, 4704.59 examples/s]Running tokenizer on dataset:  37%|███▋      | 144000/390965 [00:30<00:51, 4749.54 examples/s]Running tokenizer on dataset:  37%|███▋      | 145000/390965 [00:31<00:51, 4760.59 examples/s]Running tokenizer on dataset:  37%|███▋      | 146000/390965 [00:31<00:51, 4799.02 examples/s]Running tokenizer on dataset:  38%|███▊      | 147000/390965 [00:31<00:50, 4818.65 examples/s]Running tokenizer on dataset:  38%|███▊      | 148000/390965 [00:31<00:50, 4818.42 examples/s]Running tokenizer on dataset:  38%|███▊      | 149000/390965 [00:31<00:50, 4832.25 examples/s]Running tokenizer on dataset:  38%|███▊      | 150000/390965 [00:32<00:50, 4806.38 examples/s]Running tokenizer on dataset:  39%|███▊      | 151000/390965 [00:32<00:49, 4829.75 examples/s]Running tokenizer on dataset:  39%|███▉      | 152000/390965 [00:32<00:49, 4837.46 examples/s]Running tokenizer on dataset:  39%|███▉      | 153000/390965 [00:32<00:49, 4829.35 examples/s]Running tokenizer on dataset:  39%|███▉      | 154000/390965 [00:33<01:01, 3865.69 examples/s]Running tokenizer on dataset:  40%|███▉      | 155000/390965 [00:33<00:57, 4136.09 examples/s]Running tokenizer on dataset:  40%|███▉      | 156000/390965 [00:33<00:54, 4339.40 examples/s]Running tokenizer on dataset:  40%|████      | 157000/390965 [00:33<00:52, 4486.24 examples/s]Running tokenizer on dataset:  40%|████      | 158000/390965 [00:33<00:50, 4592.47 examples/s]Running tokenizer on dataset:  41%|████      | 159000/390965 [00:34<00:49, 4682.95 examples/s]Running tokenizer on dataset:  41%|████      | 160000/390965 [00:34<00:48, 4726.39 examples/s]Running tokenizer on dataset:  41%|████      | 161000/390965 [00:34<00:48, 4761.02 examples/s]Running tokenizer on dataset:  41%|████▏     | 162000/390965 [00:34<00:48, 4768.28 examples/s]Running tokenizer on dataset:  42%|████▏     | 163000/390965 [00:34<00:47, 4788.22 examples/s]Running tokenizer on dataset:  42%|████▏     | 164000/390965 [00:35<00:47, 4793.57 examples/s]Running tokenizer on dataset:  42%|████▏     | 165000/390965 [00:35<00:47, 4802.11 examples/s]Running tokenizer on dataset:  42%|████▏     | 166000/390965 [00:35<00:46, 4798.53 examples/s]Running tokenizer on dataset:  43%|████▎     | 167000/390965 [00:35<00:46, 4835.76 examples/s]Running tokenizer on dataset:  43%|████▎     | 168000/390965 [00:35<00:45, 4861.24 examples/s]Running tokenizer on dataset:  43%|████▎     | 169000/390965 [00:36<00:45, 4848.63 examples/s]Running tokenizer on dataset:  43%|████▎     | 170000/390965 [00:36<00:45, 4858.76 examples/s]Running tokenizer on dataset:  44%|████▎     | 171000/390965 [00:36<00:45, 4881.41 examples/s]Running tokenizer on dataset:  44%|████▍     | 172000/390965 [00:36<00:44, 4870.94 examples/s]Running tokenizer on dataset:  44%|████▍     | 173000/390965 [00:36<00:44, 4858.66 examples/s]Running tokenizer on dataset:  45%|████▍     | 174000/390965 [00:37<00:44, 4854.33 examples/s]Running tokenizer on dataset:  45%|████▍     | 175000/390965 [00:37<00:53, 4026.69 examples/s]Running tokenizer on dataset:  45%|████▌     | 176000/390965 [00:37<00:50, 4249.68 examples/s]Running tokenizer on dataset:  45%|████▌     | 177000/390965 [00:37<00:48, 4404.46 examples/s]Running tokenizer on dataset:  46%|████▌     | 178000/390965 [00:38<00:47, 4487.41 examples/s]Running tokenizer on dataset:  46%|████▌     | 179000/390965 [00:38<00:46, 4607.79 examples/s]Running tokenizer on dataset:  46%|████▌     | 180000/390965 [00:38<00:45, 4663.45 examples/s]Running tokenizer on dataset:  46%|████▋     | 181000/390965 [00:38<00:44, 4721.31 examples/s]Running tokenizer on dataset:  47%|████▋     | 182000/390965 [00:38<00:43, 4772.71 examples/s]Running tokenizer on dataset:  47%|████▋     | 183000/390965 [00:39<00:43, 4798.72 examples/s]Running tokenizer on dataset:  47%|████▋     | 184000/390965 [00:39<00:42, 4837.25 examples/s]Running tokenizer on dataset:  47%|████▋     | 185000/390965 [00:39<00:42, 4823.13 examples/s]Running tokenizer on dataset:  48%|████▊     | 186000/390965 [00:39<00:42, 4836.25 examples/s]Running tokenizer on dataset:  48%|████▊     | 187000/390965 [00:39<00:42, 4811.58 examples/s]Running tokenizer on dataset:  48%|████▊     | 188000/390965 [00:40<00:42, 4819.43 examples/s]Running tokenizer on dataset:  48%|████▊     | 189000/390965 [00:40<00:42, 4801.32 examples/s]Running tokenizer on dataset:  49%|████▊     | 190000/390965 [00:40<00:41, 4806.60 examples/s]Running tokenizer on dataset:  49%|████▉     | 191000/390965 [00:40<00:41, 4828.88 examples/s]Running tokenizer on dataset:  49%|████▉     | 192000/390965 [00:41<00:40, 4860.64 examples/s]Running tokenizer on dataset:  49%|████▉     | 193000/390965 [00:41<00:40, 4870.97 examples/s]Running tokenizer on dataset:  50%|████▉     | 194000/390965 [00:41<00:40, 4873.20 examples/s]Running tokenizer on dataset:  50%|████▉     | 195000/390965 [00:41<00:49, 3929.82 examples/s]Running tokenizer on dataset:  50%|█████     | 196000/390965 [00:41<00:46, 4154.55 examples/s]Running tokenizer on dataset:  50%|█████     | 197000/390965 [00:42<00:44, 4318.61 examples/s]Running tokenizer on dataset:  51%|█████     | 198000/390965 [00:42<00:43, 4464.66 examples/s]Running tokenizer on dataset:  51%|█████     | 199000/390965 [00:42<00:42, 4552.39 examples/s]Running tokenizer on dataset:  51%|█████     | 200000/390965 [00:42<00:41, 4619.75 examples/s]Running tokenizer on dataset:  51%|█████▏    | 201000/390965 [00:43<00:40, 4689.08 examples/s]Running tokenizer on dataset:  52%|█████▏    | 202000/390965 [00:43<00:39, 4767.38 examples/s]Running tokenizer on dataset:  52%|█████▏    | 203000/390965 [00:43<00:39, 4817.32 examples/s]Running tokenizer on dataset:  52%|█████▏    | 204000/390965 [00:43<00:38, 4847.79 examples/s]Running tokenizer on dataset:  52%|█████▏    | 205000/390965 [00:43<00:38, 4863.28 examples/s]Running tokenizer on dataset:  53%|█████▎    | 206000/390965 [00:44<00:38, 4854.81 examples/s]Running tokenizer on dataset:  53%|█████▎    | 207000/390965 [00:44<00:38, 4822.28 examples/s]Running tokenizer on dataset:  53%|█████▎    | 208000/390965 [00:44<00:37, 4844.01 examples/s]Running tokenizer on dataset:  53%|█████▎    | 209000/390965 [00:44<00:37, 4848.52 examples/s]Running tokenizer on dataset:  54%|█████▎    | 210000/390965 [00:44<00:37, 4825.28 examples/s]Running tokenizer on dataset:  54%|█████▍    | 211000/390965 [00:45<00:37, 4790.59 examples/s]Running tokenizer on dataset:  54%|█████▍    | 212000/390965 [00:45<00:45, 3928.16 examples/s]Running tokenizer on dataset:  54%|█████▍    | 213000/390965 [00:45<00:42, 4185.15 examples/s]Running tokenizer on dataset:  55%|█████▍    | 214000/390965 [00:45<00:40, 4383.90 examples/s]Running tokenizer on dataset:  55%|█████▍    | 215000/390965 [00:46<00:38, 4527.07 examples/s]Running tokenizer on dataset:  55%|█████▌    | 216000/390965 [00:46<00:37, 4615.02 examples/s]Running tokenizer on dataset:  56%|█████▌    | 217000/390965 [00:46<00:37, 4699.26 examples/s]Running tokenizer on dataset:  56%|█████▌    | 218000/390965 [00:46<00:36, 4757.96 examples/s]Running tokenizer on dataset:  56%|█████▌    | 219000/390965 [00:46<00:35, 4797.03 examples/s]Running tokenizer on dataset:  56%|█████▋    | 220000/390965 [00:47<00:35, 4813.29 examples/s]Running tokenizer on dataset:  57%|█████▋    | 221000/390965 [00:47<00:35, 4816.42 examples/s]Running tokenizer on dataset:  57%|█████▋    | 222000/390965 [00:47<00:34, 4842.74 examples/s]Running tokenizer on dataset:  57%|█████▋    | 223000/390965 [00:47<00:34, 4830.79 examples/s]Running tokenizer on dataset:  57%|█████▋    | 224000/390965 [00:47<00:34, 4826.25 examples/s]Running tokenizer on dataset:  58%|█████▊    | 225000/390965 [00:48<00:34, 4848.75 examples/s]Running tokenizer on dataset:  58%|█████▊    | 226000/390965 [00:48<00:33, 4862.59 examples/s]Running tokenizer on dataset:  58%|█████▊    | 227000/390965 [00:48<00:33, 4887.17 examples/s]Running tokenizer on dataset:  58%|█████▊    | 228000/390965 [00:48<00:33, 4894.54 examples/s]Running tokenizer on dataset:  59%|█████▊    | 229000/390965 [00:48<00:33, 4898.66 examples/s]Running tokenizer on dataset:  59%|█████▉    | 230000/390965 [00:49<00:32, 4885.50 examples/s]Running tokenizer on dataset:  59%|█████▉    | 231000/390965 [00:49<00:32, 4872.93 examples/s]Running tokenizer on dataset:  59%|█████▉    | 232000/390965 [00:49<00:32, 4830.92 examples/s]Running tokenizer on dataset:  60%|█████▉    | 233000/390965 [00:49<00:40, 3939.31 examples/s]Running tokenizer on dataset:  60%|█████▉    | 234000/390965 [00:50<00:37, 4168.21 examples/s]Running tokenizer on dataset:  60%|██████    | 235000/390965 [00:50<00:35, 4334.85 examples/s]Running tokenizer on dataset:  60%|██████    | 236000/390965 [00:50<00:34, 4489.75 examples/s]Running tokenizer on dataset:  61%|██████    | 237000/390965 [00:50<00:33, 4593.21 examples/s]Running tokenizer on dataset:  61%|██████    | 238000/390965 [00:50<00:32, 4684.82 examples/s]Running tokenizer on dataset:  61%|██████    | 239000/390965 [00:51<00:32, 4744.24 examples/s]Running tokenizer on dataset:  61%|██████▏   | 240000/390965 [00:51<00:31, 4788.58 examples/s]Running tokenizer on dataset:  62%|██████▏   | 241000/390965 [00:51<00:31, 4807.85 examples/s]Running tokenizer on dataset:  62%|██████▏   | 242000/390965 [00:51<00:30, 4810.01 examples/s]Running tokenizer on dataset:  62%|██████▏   | 243000/390965 [00:51<00:30, 4824.07 examples/s]Running tokenizer on dataset:  62%|██████▏   | 244000/390965 [00:52<00:30, 4835.37 examples/s]Running tokenizer on dataset:  63%|██████▎   | 245000/390965 [00:52<00:30, 4826.30 examples/s]Running tokenizer on dataset:  63%|██████▎   | 246000/390965 [00:52<00:29, 4835.47 examples/s]Running tokenizer on dataset:  63%|██████▎   | 247000/390965 [00:52<00:29, 4829.12 examples/s]Running tokenizer on dataset:  63%|██████▎   | 248000/390965 [00:53<00:29, 4870.65 examples/s]Running tokenizer on dataset:  64%|██████▎   | 249000/390965 [00:53<00:29, 4885.95 examples/s]Running tokenizer on dataset:  64%|██████▍   | 250000/390965 [00:53<00:28, 4902.36 examples/s]Running tokenizer on dataset:  64%|██████▍   | 251000/390965 [00:53<00:28, 4903.74 examples/s]Running tokenizer on dataset:  64%|██████▍   | 252000/390965 [00:53<00:28, 4897.82 examples/s]Running tokenizer on dataset:  65%|██████▍   | 253000/390965 [00:54<00:34, 4016.91 examples/s]Running tokenizer on dataset:  65%|██████▍   | 254000/390965 [00:54<00:32, 4237.10 examples/s]Running tokenizer on dataset:  65%|██████▌   | 255000/390965 [00:54<00:30, 4408.33 examples/s]Running tokenizer on dataset:  65%|██████▌   | 256000/390965 [00:54<00:29, 4508.68 examples/s]Running tokenizer on dataset:  66%|██████▌   | 257000/390965 [00:55<00:29, 4601.33 examples/s]Running tokenizer on dataset:  66%|██████▌   | 258000/390965 [00:55<00:28, 4664.07 examples/s]Running tokenizer on dataset:  66%|██████▌   | 259000/390965 [00:55<00:28, 4626.39 examples/s]Running tokenizer on dataset:  67%|██████▋   | 260000/390965 [00:55<00:27, 4711.56 examples/s]Running tokenizer on dataset:  67%|██████▋   | 261000/390965 [00:55<00:27, 4765.28 examples/s]Running tokenizer on dataset:  67%|██████▋   | 262000/390965 [00:56<00:26, 4810.57 examples/s]Running tokenizer on dataset:  67%|██████▋   | 263000/390965 [00:56<00:26, 4828.64 examples/s]Running tokenizer on dataset:  68%|██████▊   | 264000/390965 [00:56<00:26, 4858.27 examples/s]Running tokenizer on dataset:  68%|██████▊   | 265000/390965 [00:56<00:25, 4869.06 examples/s]Running tokenizer on dataset:  68%|██████▊   | 266000/390965 [00:56<00:25, 4883.94 examples/s]Running tokenizer on dataset:  68%|██████▊   | 267000/390965 [00:57<00:25, 4886.08 examples/s]Running tokenizer on dataset:  69%|██████▊   | 268000/390965 [00:57<00:25, 4866.15 examples/s]Running tokenizer on dataset:  69%|██████▉   | 269000/390965 [00:57<00:25, 4864.90 examples/s]Running tokenizer on dataset:  69%|██████▉   | 270000/390965 [00:57<00:30, 4007.22 examples/s]Running tokenizer on dataset:  69%|██████▉   | 271000/390965 [00:58<00:28, 4208.98 examples/s]Running tokenizer on dataset:  70%|██████▉   | 272000/390965 [00:58<00:26, 4406.88 examples/s]Running tokenizer on dataset:  70%|██████▉   | 273000/390965 [00:58<00:25, 4549.70 examples/s]Running tokenizer on dataset:  70%|███████   | 274000/390965 [00:58<00:25, 4661.29 examples/s]Running tokenizer on dataset:  70%|███████   | 275000/390965 [00:58<00:24, 4739.21 examples/s]Running tokenizer on dataset:  71%|███████   | 276000/390965 [00:59<00:24, 4765.08 examples/s]Running tokenizer on dataset:  71%|███████   | 277000/390965 [00:59<00:23, 4793.24 examples/s]Running tokenizer on dataset:  71%|███████   | 278000/390965 [00:59<00:23, 4820.57 examples/s]Running tokenizer on dataset:  71%|███████▏  | 279000/390965 [00:59<00:23, 4848.62 examples/s]Running tokenizer on dataset:  72%|███████▏  | 280000/390965 [00:59<00:22, 4860.50 examples/s]Running tokenizer on dataset:  72%|███████▏  | 281000/390965 [01:00<00:22, 4877.95 examples/s]Running tokenizer on dataset:  72%|███████▏  | 282000/390965 [01:00<00:22, 4892.44 examples/s]Running tokenizer on dataset:  72%|███████▏  | 283000/390965 [01:00<00:21, 4907.84 examples/s]Running tokenizer on dataset:  73%|███████▎  | 284000/390965 [01:00<00:21, 4918.81 examples/s]Running tokenizer on dataset:  73%|███████▎  | 285000/390965 [01:00<00:21, 4896.91 examples/s]Running tokenizer on dataset:  73%|███████▎  | 286000/390965 [01:01<00:21, 4884.66 examples/s]Running tokenizer on dataset:  73%|███████▎  | 287000/390965 [01:01<00:21, 4867.49 examples/s]Running tokenizer on dataset:  74%|███████▎  | 288000/390965 [01:01<00:21, 4872.91 examples/s]Running tokenizer on dataset:  74%|███████▍  | 289000/390965 [01:01<00:20, 4876.78 examples/s]Running tokenizer on dataset:  74%|███████▍  | 290000/390965 [01:01<00:20, 4842.74 examples/s]Running tokenizer on dataset:  74%|███████▍  | 291000/390965 [01:02<00:25, 3980.07 examples/s]Running tokenizer on dataset:  75%|███████▍  | 292000/390965 [01:02<00:23, 4192.86 examples/s]Running tokenizer on dataset:  75%|███████▍  | 293000/390965 [01:02<00:22, 4354.22 examples/s]Running tokenizer on dataset:  75%|███████▌  | 294000/390965 [01:02<00:21, 4491.10 examples/s]Running tokenizer on dataset:  75%|███████▌  | 295000/390965 [01:03<00:20, 4610.55 examples/s]Running tokenizer on dataset:  76%|███████▌  | 296000/390965 [01:03<00:20, 4693.81 examples/s]Running tokenizer on dataset:  76%|███████▌  | 297000/390965 [01:03<00:19, 4753.81 examples/s]Running tokenizer on dataset:  76%|███████▌  | 298000/390965 [01:03<00:19, 4796.46 examples/s]Running tokenizer on dataset:  76%|███████▋  | 299000/390965 [01:03<00:19, 4808.92 examples/s]Running tokenizer on dataset:  77%|███████▋  | 300000/390965 [01:04<00:18, 4818.72 examples/s]Running tokenizer on dataset:  77%|███████▋  | 301000/390965 [01:04<00:18, 4839.03 examples/s]Running tokenizer on dataset:  77%|███████▋  | 302000/390965 [01:04<00:18, 4851.22 examples/s]Running tokenizer on dataset:  78%|███████▊  | 303000/390965 [01:04<00:18, 4852.92 examples/s]Running tokenizer on dataset:  78%|███████▊  | 304000/390965 [01:04<00:17, 4841.03 examples/s]Running tokenizer on dataset:  78%|███████▊  | 305000/390965 [01:05<00:17, 4832.73 examples/s]Running tokenizer on dataset:  78%|███████▊  | 306000/390965 [01:05<00:17, 4855.85 examples/s]Running tokenizer on dataset:  79%|███████▊  | 307000/390965 [01:05<00:17, 4861.60 examples/s]Running tokenizer on dataset:  79%|███████▉  | 308000/390965 [01:05<00:17, 4860.88 examples/s]Running tokenizer on dataset:  79%|███████▉  | 309000/390965 [01:05<00:16, 4889.10 examples/s]Running tokenizer on dataset:  79%|███████▉  | 310000/390965 [01:06<00:16, 4891.41 examples/s]Running tokenizer on dataset:  80%|███████▉  | 311000/390965 [01:06<00:19, 4026.28 examples/s]Running tokenizer on dataset:  80%|███████▉  | 312000/390965 [01:06<00:18, 4235.48 examples/s]Running tokenizer on dataset:  80%|████████  | 313000/390965 [01:06<00:17, 4401.50 examples/s]Running tokenizer on dataset:  80%|████████  | 314000/390965 [01:07<00:17, 4495.20 examples/s]Running tokenizer on dataset:  81%|████████  | 315000/390965 [01:07<00:16, 4586.95 examples/s]Running tokenizer on dataset:  81%|████████  | 316000/390965 [01:07<00:16, 4653.04 examples/s]Running tokenizer on dataset:  81%|████████  | 317000/390965 [01:07<00:15, 4710.14 examples/s]Running tokenizer on dataset:  81%|████████▏ | 318000/390965 [01:07<00:15, 4772.26 examples/s]Running tokenizer on dataset:  82%|████████▏ | 319000/390965 [01:08<00:14, 4817.14 examples/s]Running tokenizer on dataset:  82%|████████▏ | 320000/390965 [01:08<00:14, 4847.53 examples/s]Running tokenizer on dataset:  82%|████████▏ | 321000/390965 [01:08<00:14, 4877.35 examples/s]Running tokenizer on dataset:  82%|████████▏ | 322000/390965 [01:08<00:14, 4880.89 examples/s]Running tokenizer on dataset:  83%|████████▎ | 323000/390965 [01:09<00:13, 4902.98 examples/s]Running tokenizer on dataset:  83%|████████▎ | 324000/390965 [01:09<00:13, 4916.92 examples/s]Running tokenizer on dataset:  83%|████████▎ | 325000/390965 [01:09<00:13, 4915.18 examples/s]Running tokenizer on dataset:  83%|████████▎ | 326000/390965 [01:09<00:13, 4916.82 examples/s]Running tokenizer on dataset:  84%|████████▎ | 327000/390965 [01:09<00:13, 4897.58 examples/s]Running tokenizer on dataset:  84%|████████▍ | 328000/390965 [01:10<00:15, 3975.14 examples/s]Running tokenizer on dataset:  84%|████████▍ | 329000/390965 [01:10<00:14, 4224.45 examples/s]Running tokenizer on dataset:  84%|████████▍ | 330000/390965 [01:10<00:13, 4412.02 examples/s]Running tokenizer on dataset:  85%|████████▍ | 331000/390965 [01:10<00:13, 4543.66 examples/s]Running tokenizer on dataset:  85%|████████▍ | 332000/390965 [01:10<00:12, 4654.04 examples/s]Running tokenizer on dataset:  85%|████████▌ | 333000/390965 [01:11<00:12, 4740.01 examples/s]Running tokenizer on dataset:  85%|████████▌ | 334000/390965 [01:11<00:11, 4791.97 examples/s]Running tokenizer on dataset:  86%|████████▌ | 335000/390965 [01:11<00:11, 4807.90 examples/s]Running tokenizer on dataset:  86%|████████▌ | 336000/390965 [01:11<00:11, 4811.23 examples/s]Running tokenizer on dataset:  86%|████████▌ | 337000/390965 [01:12<00:11, 4821.30 examples/s]Running tokenizer on dataset:  86%|████████▋ | 338000/390965 [01:12<00:10, 4816.27 examples/s]Running tokenizer on dataset:  87%|████████▋ | 339000/390965 [01:12<00:10, 4809.55 examples/s]Running tokenizer on dataset:  87%|████████▋ | 340000/390965 [01:12<00:10, 4814.98 examples/s]Running tokenizer on dataset:  87%|████████▋ | 341000/390965 [01:12<00:10, 4839.33 examples/s]Running tokenizer on dataset:  87%|████████▋ | 342000/390965 [01:13<00:10, 4843.54 examples/s]Running tokenizer on dataset:  88%|████████▊ | 343000/390965 [01:13<00:09, 4856.47 examples/s]Running tokenizer on dataset:  88%|████████▊ | 344000/390965 [01:13<00:09, 4868.88 examples/s]Running tokenizer on dataset:  88%|████████▊ | 345000/390965 [01:13<00:09, 4886.14 examples/s]Running tokenizer on dataset:  88%|████████▊ | 346000/390965 [01:13<00:09, 4886.50 examples/s]Running tokenizer on dataset:  89%|████████▉ | 347000/390965 [01:14<00:09, 4845.84 examples/s]Running tokenizer on dataset:  89%|████████▉ | 348000/390965 [01:14<00:08, 4862.67 examples/s]Running tokenizer on dataset:  89%|████████▉ | 349000/390965 [01:14<00:10, 3912.86 examples/s]Running tokenizer on dataset:  90%|████████▉ | 350000/390965 [01:14<00:09, 4149.91 examples/s]Running tokenizer on dataset:  90%|████████▉ | 351000/390965 [01:15<00:09, 4332.69 examples/s]Running tokenizer on dataset:  90%|█████████ | 352000/390965 [01:15<00:08, 4495.55 examples/s]Running tokenizer on dataset:  90%|█████████ | 353000/390965 [01:15<00:08, 4623.05 examples/s]Running tokenizer on dataset:  91%|█████████ | 354000/390965 [01:15<00:07, 4670.13 examples/s]Running tokenizer on dataset:  91%|█████████ | 355000/390965 [01:15<00:07, 4740.73 examples/s]Running tokenizer on dataset:  91%|█████████ | 356000/390965 [01:16<00:07, 4791.30 examples/s]Running tokenizer on dataset:  91%|█████████▏| 357000/390965 [01:16<00:07, 4819.78 examples/s]Running tokenizer on dataset:  92%|█████████▏| 358000/390965 [01:16<00:06, 4846.45 examples/s]Running tokenizer on dataset:  92%|█████████▏| 359000/390965 [01:16<00:06, 4852.62 examples/s]Running tokenizer on dataset:  92%|█████████▏| 360000/390965 [01:16<00:06, 4870.30 examples/s]Running tokenizer on dataset:  92%|█████████▏| 361000/390965 [01:17<00:06, 4873.83 examples/s]Running tokenizer on dataset:  93%|█████████▎| 362000/390965 [01:17<00:05, 4830.30 examples/s]Running tokenizer on dataset:  93%|█████████▎| 363000/390965 [01:17<00:05, 4829.66 examples/s]Running tokenizer on dataset:  93%|█████████▎| 364000/390965 [01:17<00:05, 4872.61 examples/s]Running tokenizer on dataset:  93%|█████████▎| 365000/390965 [01:17<00:05, 4873.19 examples/s]Running tokenizer on dataset:  94%|█████████▎| 366000/390965 [01:18<00:05, 4877.87 examples/s]Running tokenizer on dataset:  94%|█████████▍| 367000/390965 [01:18<00:04, 4856.92 examples/s]Running tokenizer on dataset:  94%|█████████▍| 368000/390965 [01:18<00:04, 4866.52 examples/s]Running tokenizer on dataset:  94%|█████████▍| 369000/390965 [01:18<00:05, 4053.52 examples/s]Running tokenizer on dataset:  95%|█████████▍| 370000/390965 [01:19<00:04, 4263.86 examples/s]Running tokenizer on dataset:  95%|█████████▍| 371000/390965 [01:19<00:04, 4431.47 examples/s]Running tokenizer on dataset:  95%|█████████▌| 372000/390965 [01:19<00:04, 4557.13 examples/s]Running tokenizer on dataset:  95%|█████████▌| 373000/390965 [01:19<00:03, 4639.59 examples/s]Running tokenizer on dataset:  96%|█████████▌| 374000/390965 [01:19<00:03, 4699.73 examples/s]Running tokenizer on dataset:  96%|█████████▌| 375000/390965 [01:20<00:03, 4745.01 examples/s]Running tokenizer on dataset:  96%|█████████▌| 376000/390965 [01:20<00:03, 4784.31 examples/s]Running tokenizer on dataset:  96%|█████████▋| 377000/390965 [01:20<00:02, 4839.15 examples/s]Running tokenizer on dataset:  97%|█████████▋| 378000/390965 [01:20<00:02, 4860.85 examples/s]Running tokenizer on dataset:  97%|█████████▋| 379000/390965 [01:20<00:02, 4883.68 examples/s]Running tokenizer on dataset:  97%|█████████▋| 380000/390965 [01:21<00:02, 4908.10 examples/s]Running tokenizer on dataset:  97%|█████████▋| 381000/390965 [01:21<00:02, 4886.88 examples/s]Running tokenizer on dataset:  98%|█████████▊| 382000/390965 [01:21<00:01, 4902.87 examples/s]Running tokenizer on dataset:  98%|█████████▊| 383000/390965 [01:21<00:01, 4889.04 examples/s]Running tokenizer on dataset:  98%|█████████▊| 384000/390965 [01:21<00:01, 4886.12 examples/s]Running tokenizer on dataset:  98%|█████████▊| 385000/390965 [01:22<00:01, 4893.93 examples/s]Running tokenizer on dataset:  99%|█████████▊| 386000/390965 [01:22<00:01, 4013.98 examples/s]Running tokenizer on dataset:  99%|█████████▉| 387000/390965 [01:22<00:00, 4253.54 examples/s]Running tokenizer on dataset:  99%|█████████▉| 388000/390965 [01:22<00:00, 4442.23 examples/s]Running tokenizer on dataset:  99%|█████████▉| 389000/390965 [01:23<00:00, 4570.06 examples/s]Running tokenizer on dataset: 100%|█████████▉| 390000/390965 [01:23<00:00, 4638.76 examples/s]Running tokenizer on dataset: 100%|██████████| 390965/390965 [01:23<00:00, 4732.30 examples/s]                                                                                              Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]Downloading builder script: 5.76kB [00:00, 3.99MB/s]                   
double check the prune location is loaded correctly: [3, 4, 5, 6, 7, 8, 9, 10, 11]
double check hard_token_mask: <class 'NoneType'>
Training Arguments
TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=3000,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=40,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=/mnt/data/device-aware-bert/token_pruning/experiments/QQP/reproduce1/s0.65_lr2e-05_reglr0.01_alpha0.0001_warmup10_bin50/runs/Jul19_14-34-10_node-0,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=100,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=40.0,
optim=OptimizerNames.ADAMW_HF,
output_dir=/mnt/data/device-aware-bert/token_pruning/experiments/QQP/reproduce1/s0.65_lr2e-05_reglr0.01_alpha0.0001_warmup10_bin50,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=32,
per_device_train_batch_size=32,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
remove_unused_columns=True,
report_to=['mlflow'],
resume_from_checkpoint=None,
run_name=/mnt/data/device-aware-bert/token_pruning/experiments/QQP/reproduce1/s0.65_lr2e-05_reglr0.01_alpha0.0001_warmup10_bin50,
save_on_each_node=False,
save_steps=0,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=57,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
Additional Arguments
AdditionalArguments(test=False, ex_name='s0.65_lr2e-05_reglr0.01_alpha0.0001_warmup10_bin50', pruning_type='token+pruner', reg_learning_rate=0.01, scheduler_type='linear', freeze_embeddings=True, pretrained_pruned_model=None, droprate_init=0.01, temperature=0.6666666666666666, prepruning_finetune_epochs=1, lagrangian_warmup_epochs=10, target_sparsity=0.65, sparsity_epsilon=0, distillation_path='/mnt/data/device-aware-bert/token_pruning/teachers/QQP', do_distill=True, do_layer_distill=False, layer_distill_version=4, distill_loss_alpha=0.9, distill_ce_loss_alpha=0.0001, distill_temp=2.0, use_mac_l0=True, prune_location=[3, 4, 5, 6, 7, 8, 9, 10, 11], bin_num=50, topk=20)
----------------------------------------------------------------------
time: 2023-07-19 14:41:28
Evaluating: accuracy: 0.912, eval_loss: 0.4066, step: 0
lambda_1: 0.0000, lambda_2: 0.0000 lambda_3: 0.0000
Starting l0 regularization! using <class 'models.l0_module.L0ModuleForMAC'>, temperature: 0.67, init drop rate: 0.01 token_loga shape: [9, 50] prune location: [3, 4, 5, 6, 7, 8, 9, 10, 11]
NDCG TOPK= 20
loss: 0.026014, lagrangian_loss: -0.002583, attention_score_distillation_loss: 0.000970
----------------------------------------------------------------------
time: 2023-07-19 14:55:50
Evaluating: accuracy: 0.9052, eval_loss: 0.4611, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.766, target_sparsity: 0.0171, step: 3000
lambda_1: 0.7990, lambda_2: 36.6934 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   0.99 1.   0.99]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
loss: 0.051543, lagrangian_loss: -0.004071, attention_score_distillation_loss: 0.000949
loss: 0.020752, lagrangian_loss: 0.005120, attention_score_distillation_loss: 0.000937
----------------------------------------------------------------------
time: 2023-07-19 15:10:12
Evaluating: accuracy: 0.903, eval_loss: 0.4714, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0278, expected_sparsity: 0.0241, expected_sequence_sparsity: 0.7716, target_sparsity: 0.0343, step: 6000
lambda_1: -2.4793, lambda_2: 48.1642 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   0.99 1.   0.83]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
10111111111111111111111110111101111101110111100100
loss: 0.108954, lagrangian_loss: 0.001946, attention_score_distillation_loss: 0.000925
loss: 0.363066, lagrangian_loss: -0.000205, attention_score_distillation_loss: 0.000915
----------------------------------------------------------------------
time: 2023-07-19 15:24:35
Evaluating: accuracy: 0.904, eval_loss: 0.415, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.037, expected_sparsity: 0.0348, expected_sequence_sparsity: 0.7742, target_sparsity: 0.0514, step: 9000
lambda_1: 0.5591, lambda_2: 56.8841 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   0.98 0.98 0.74]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
00101111111111111111111110111101111101100111000100
loss: 0.146176, lagrangian_loss: -0.000136, attention_score_distillation_loss: 0.000914
loss: 0.598854, lagrangian_loss: 0.000056, attention_score_distillation_loss: 0.000880
ETA: 1 day, 9:37:52 | Epoch 0 finished. Took 3104.41 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:38:59
Evaluating: accuracy: 0.9067, eval_loss: 0.4672, token_prune_loc: [False, False, False, False, False, False, True, True, True], macs_sparsity: 0.0723, expected_sparsity: 0.0676, expected_sequence_sparsity: 0.7819, target_sparsity: 0.0686, step: 12000
lambda_1: 0.6324, lambda_2: 74.4152 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   0.96 0.95 0.68]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.94, 0.68]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.88, 0.6]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
10111111111111111111111110111111111111111111111110
11111111111111111111101111111111111111101111111110
00101111111111111111111110111101001101100011000100
loss: 0.251421, lagrangian_loss: -0.000140, attention_score_distillation_loss: 0.000883
loss: 0.280520, lagrangian_loss: 0.001688, attention_score_distillation_loss: 0.000867
----------------------------------------------------------------------
time: 2023-07-19 15:53:21
Evaluating: accuracy: 0.9063, eval_loss: 0.4611, token_prune_loc: [False, False, False, False, False, False, True, True, True], macs_sparsity: 0.0843, expected_sparsity: 0.0754, expected_sequence_sparsity: 0.7837, target_sparsity: 0.0857, step: 15000
lambda_1: -2.0802, lambda_2: 88.1947 lambda_3: 0.0000
train remain: [0.99 1.   1.   1.   0.99 0.99 0.94 0.93 0.65]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.92, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.86, 0.55]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
10111111111111111111111110111111111111111111111110
11111111111111111111101110111111111111101111111110
00101111111111111111111110111101001101100010000000
loss: 0.314539, lagrangian_loss: 0.002501, attention_score_distillation_loss: 0.000848
loss: 0.739795, lagrangian_loss: -0.000702, attention_score_distillation_loss: 0.000843
----------------------------------------------------------------------
time: 2023-07-19 16:07:40
Evaluating: accuracy: 0.9039, eval_loss: 0.4471, token_prune_loc: [False, False, False, False, False, False, True, True, True], macs_sparsity: 0.0843, expected_sparsity: 0.0808, expected_sequence_sparsity: 0.785, target_sparsity: 0.1029, step: 18000
lambda_1: -2.2684, lambda_2: 106.6377 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.99 0.98 0.99 0.94 0.91 0.61]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.9, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.85, 0.52]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
10111111111111111111111110111111111111111111111110
11111111111111111111101110111111111110101111111110
00101111111111111111111110111101000101100010000000
loss: 0.108628, lagrangian_loss: 0.007192, attention_score_distillation_loss: 0.000823
loss: 0.436417, lagrangian_loss: -0.000340, attention_score_distillation_loss: 0.000821
----------------------------------------------------------------------
time: 2023-07-19 16:22:04
Evaluating: accuracy: 0.9046, eval_loss: 0.4555, token_prune_loc: [False, False, False, False, True, False, True, True, True], macs_sparsity: 0.1243, expected_sparsity: 0.1137, expected_sequence_sparsity: 0.7927, target_sparsity: 0.12, step: 21000
lambda_1: -1.4880, lambda_2: 123.1607 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.99 0.95 0.99 0.94 0.88 0.61]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 0.94, 0.88, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.86, 0.76, 0.47]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111110111111111111111111011100
11111111111111111111111111111111111111111111111111
10111111111111111111111110111111111111111111111110
11111111111111111111101110111111111111101011111100
00101111111111111111111110111100000101100010000001
loss: 0.019959, lagrangian_loss: -0.001502, attention_score_distillation_loss: 0.000807
loss: 0.117746, lagrangian_loss: -0.000817, attention_score_distillation_loss: 0.000782
ETA: 1 day, 9:51:18 | Epoch 1 finished. Took 3310.24 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:36:26
Evaluating: accuracy: 0.9047, eval_loss: 0.4748, token_prune_loc: [False, False, False, False, True, False, True, True, True], macs_sparsity: 0.1345, expected_sparsity: 0.1298, expected_sequence_sparsity: 0.7965, target_sparsity: 0.1372, step: 24000
lambda_1: -1.3561, lambda_2: 136.4560 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.98 0.93 0.98 0.93 0.87 0.61]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.9, 1.0, 0.92, 0.86, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.83, 0.71, 0.43]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111110111111111111111111010100
11111111111111111111111111111111111111111111111111
10111111111111111111111110111111111111111110111110
11111111111111111111101110111111111110101011111100
00101111111111111111111110111100000101100010000000
loss: 0.153738, lagrangian_loss: 0.000808, attention_score_distillation_loss: 0.000781
loss: 0.031842, lagrangian_loss: -0.001276, attention_score_distillation_loss: 0.000768
----------------------------------------------------------------------
time: 2023-07-19 16:50:49
Evaluating: accuracy: 0.902, eval_loss: 0.5116, token_prune_loc: [False, False, False, False, True, True, True, True, True], macs_sparsity: 0.1558, expected_sparsity: 0.1487, expected_sequence_sparsity: 0.801, target_sparsity: 0.1543, step: 27000
lambda_1: -0.0448, lambda_2: 151.1529 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.98 0.91 0.96 0.92 0.86 0.59]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.9, 0.94, 0.9, 0.86, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.85, 0.76, 0.65, 0.39]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111110111111111111111111010100
11111111111111111111111111111111111111111111101100
10111111111111111111111110111111111111111110101110
11111111111111111111101110111111111110101011111100
00101111111111111111111110111100000100101010000000
loss: 0.524790, lagrangian_loss: 0.000024, attention_score_distillation_loss: 0.000743
loss: 0.263436, lagrangian_loss: 0.000239, attention_score_distillation_loss: 0.000728
----------------------------------------------------------------------
time: 2023-07-19 17:05:15
Evaluating: accuracy: 0.9038, eval_loss: 0.4867, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.1957, expected_sparsity: 0.1855, expected_sequence_sparsity: 0.8096, target_sparsity: 0.1715, step: 30000
lambda_1: -1.4335, lambda_2: 172.2853 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.96 0.9  0.96 0.89 0.85 0.58]
infer remain: [1.0, 1.0, 1.0, 0.94, 0.88, 0.94, 0.88, 0.84, 0.58]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.83, 0.78, 0.68, 0.57, 0.33]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111110011110
11111111111111111111111110111111111111111111000100
11111111111111111111111111111111111111111111101100
10111111111111111111111110111111111111111110101100
11111111111111111111101110111111111110101011110100
00101111111111111111111110111100000100100010000000
loss: 0.208901, lagrangian_loss: -0.001067, attention_score_distillation_loss: 0.000725
loss: 0.145273, lagrangian_loss: 0.001229, attention_score_distillation_loss: 0.000702
----------------------------------------------------------------------
time: 2023-07-19 17:19:40
Evaluating: accuracy: 0.9055, eval_loss: 0.4734, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.217, expected_sparsity: 0.2022, expected_sequence_sparsity: 0.8135, target_sparsity: 0.1886, step: 33000
lambda_1: -0.5405, lambda_2: 187.6610 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.95 0.88 0.96 0.86 0.86 0.58]
infer remain: [1.0, 1.0, 1.0, 0.92, 0.86, 0.94, 0.86, 0.84, 0.58]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.79, 0.74, 0.64, 0.54, 0.31]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111110001110
11111111111111111111111110111111111111111011000100
11111111111111111111111111111111111111111111101100
10111111111111111111111110111111111111111110100100
11111111111111111111101110111111111110101011110100
10101111111111111111111110110100000100100010000000
loss: 0.636113, lagrangian_loss: -0.000260, attention_score_distillation_loss: 0.000695
ETA: 1 day, 9:21:26 | Epoch 2 finished. Took 3322.07 seconds.
loss: 0.011457, lagrangian_loss: 0.000860, attention_score_distillation_loss: 0.000686
----------------------------------------------------------------------
time: 2023-07-19 17:34:04
Evaluating: accuracy: 0.905, eval_loss: 0.4898, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2226, expected_sparsity: 0.2113, expected_sequence_sparsity: 0.8157, target_sparsity: 0.2058, step: 36000
lambda_1: -0.8275, lambda_2: 209.8195 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.93 0.86 0.96 0.85 0.85 0.58]
infer remain: [1.0, 1.0, 1.0, 0.92, 0.84, 0.94, 0.84, 0.84, 0.58]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.77, 0.73, 0.61, 0.51, 0.3]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111110001110
11111111111111111111111110111111111111111011000000
11111111111111111111111111111111111111111111101100
10111111111111111111111110111111111011111110100100
11111111111111111111101110111111111110101011110100
00101111111111111111111110110100000100100010100000
loss: 0.038668, lagrangian_loss: 0.000550, attention_score_distillation_loss: 0.000674
loss: 0.019112, lagrangian_loss: -0.000130, attention_score_distillation_loss: 0.000661
----------------------------------------------------------------------
time: 2023-07-19 17:48:28
Evaluating: accuracy: 0.9035, eval_loss: 0.5023, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2365, expected_sparsity: 0.2281, expected_sequence_sparsity: 0.8196, target_sparsity: 0.2229, step: 39000
lambda_1: -1.4702, lambda_2: 219.3213 lambda_3: 0.0000
train remain: [0.98 1.   1.   0.91 0.84 0.94 0.86 0.85 0.57]
infer remain: [1.0, 1.0, 1.0, 0.9, 0.82, 0.92, 0.84, 0.84, 0.58]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.74, 0.68, 0.57, 0.48, 0.28]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111110001100
11111111111111111111111110111111111111111010000000
11111111111111111111111110111111111111111111101100
10111111111111111111111110111111111011111110100100
11111111111111111111101110111111111110101011110100
00101111111111111111111110110100001100100010000000
loss: 0.051066, lagrangian_loss: -0.000569, attention_score_distillation_loss: 0.000648
loss: 0.162792, lagrangian_loss: -0.000987, attention_score_distillation_loss: 0.000631
----------------------------------------------------------------------
time: 2023-07-19 18:02:45
Evaluating: accuracy: 0.9057, eval_loss: 0.4993, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2523, expected_sparsity: 0.2417, expected_sequence_sparsity: 0.8228, target_sparsity: 0.2401, step: 42000
lambda_1: -2.3362, lambda_2: 238.0361 lambda_3: 0.0000
train remain: [0.98 1.   0.99 0.9  0.82 0.93 0.85 0.85 0.56]
infer remain: [1.0, 1.0, 1.0, 0.88, 0.8, 0.92, 0.84, 0.84, 0.56]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.7, 0.65, 0.54, 0.46, 0.26]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111110000100
11111111111111111111111110111111111111110010000000
11111111111111111111111110111111111111111111101100
10111111111111111111111110111111111011111110100100
11111111111111111111101110111111111110101011110100
00101111111111111111111110110100000100100000000001
loss: 0.134315, lagrangian_loss: -0.002282, attention_score_distillation_loss: 0.000621
loss: 0.673241, lagrangian_loss: 0.001305, attention_score_distillation_loss: 0.000607
----------------------------------------------------------------------
time: 2023-07-19 18:17:01
Evaluating: accuracy: 0.9045, eval_loss: 0.4668, token_prune_loc: [True, False, False, True, True, True, True, True, True], macs_sparsity: 0.2876, expected_sparsity: 0.2697, expected_sequence_sparsity: 0.8294, target_sparsity: 0.2572, step: 45000
lambda_1: -1.0491, lambda_2: 256.7698 lambda_3: 0.0000
train remain: [0.97 1.   0.99 0.89 0.79 0.93 0.85 0.85 0.55]
infer remain: [0.96, 1.0, 1.0, 0.88, 0.78, 0.92, 0.84, 0.84, 0.54]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.96, 0.96, 0.84, 0.66, 0.61, 0.51, 0.43, 0.23]
11111111111111111111111111111111111111111111111100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111110000100
11111111111111111111111110111111111111010010000000
11111111111111111111111110111111111111111111101100
10111111111111111111111110111111111011111110100100
11111111111111111111101110111111111110101011110100
00101111111111111111111110110100000100100000000000
loss: 0.053032, lagrangian_loss: 0.000619, attention_score_distillation_loss: 0.000598
ETA: 1 day, 8:35:35 | Epoch 3 finished. Took 3300.53 seconds.
loss: 0.674372, lagrangian_loss: 0.000077, attention_score_distillation_loss: 0.000574
----------------------------------------------------------------------
time: 2023-07-19 18:31:17
Evaluating: accuracy: 0.9051, eval_loss: 0.491, token_prune_loc: [True, False, False, True, True, True, True, True, True], macs_sparsity: 0.2903, expected_sparsity: 0.275, expected_sequence_sparsity: 0.8306, target_sparsity: 0.2744, step: 48000
lambda_1: -1.7333, lambda_2: 272.6484 lambda_3: 0.0000
train remain: [0.97 1.   0.98 0.88 0.77 0.93 0.84 0.85 0.54]
infer remain: [0.96, 1.0, 1.0, 0.88, 0.76, 0.92, 0.84, 0.84, 0.54]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.96, 0.96, 0.84, 0.64, 0.59, 0.5, 0.42, 0.23]
11111111111111111111111111111111111111111111111100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111110000100
11111111111111111111111110111111111011010010000000
11111111111111111111111110111111111111111111101100
10111111111111111111111110111111111011111110100100
11111111111111111111101110111111111110101011110100
00101111111111111111111110110100000100100000000000
loss: 0.041096, lagrangian_loss: 0.001828, attention_score_distillation_loss: 0.000557
loss: 0.590846, lagrangian_loss: -0.001445, attention_score_distillation_loss: 0.000560
----------------------------------------------------------------------
time: 2023-07-19 18:45:34
Evaluating: accuracy: 0.9051, eval_loss: 0.4827, token_prune_loc: [True, False, True, True, True, True, True, True, True], macs_sparsity: 0.3126, expected_sparsity: 0.3016, expected_sequence_sparsity: 0.8369, target_sparsity: 0.2915, step: 51000
lambda_1: -1.8712, lambda_2: 290.0020 lambda_3: 0.0000
train remain: [0.97 0.99 0.95 0.86 0.77 0.92 0.84 0.84 0.54]
infer remain: [0.96, 1.0, 0.94, 0.86, 0.76, 0.92, 0.84, 0.84, 0.54]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.96, 0.9, 0.78, 0.59, 0.54, 0.46, 0.38, 0.21]
11111111111111111111111111111111111111111111111100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111110111100
11111111111111111111111111111111111111110110000100
11111111111111111111111110111111111011010010000000
11111111111111111111111110111111111111111111101100
10111111111111111111111110111111111011111110100100
11111111111111111111101110111111111110101011110100
00101111111111111111111110010100001100100000000000
loss: 0.387114, lagrangian_loss: -0.001067, attention_score_distillation_loss: 0.000547
loss: 0.360754, lagrangian_loss: 0.003099, attention_score_distillation_loss: 0.000520
----------------------------------------------------------------------
time: 2023-07-19 18:59:58
Evaluating: accuracy: 0.901, eval_loss: 0.4962, token_prune_loc: [True, False, True, True, True, True, True, True, True], macs_sparsity: 0.3395, expected_sparsity: 0.3218, expected_sequence_sparsity: 0.8417, target_sparsity: 0.3087, step: 54000
lambda_1: -0.3097, lambda_2: 309.7510 lambda_3: 0.0000
train remain: [0.97 0.99 0.94 0.85 0.75 0.91 0.83 0.84 0.53]
infer remain: [0.96, 1.0, 0.92, 0.84, 0.74, 0.92, 0.82, 0.84, 0.52]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.96, 0.88, 0.74, 0.55, 0.51, 0.41, 0.35, 0.18]
11111111111111111111111111111111111111111111111100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111110101100
11111111111111111111111111111111111111110110000000
11111111111111111111111110111111111010010010000000
11111111111111111111111110111111111111111111101100
10111111111111111111111110111111111011111110100000
11111111111111111111101110111111111110101011110100
00101111111111111111111110010100000100100000000000
loss: 0.595147, lagrangian_loss: 0.000261, attention_score_distillation_loss: 0.000511
loss: 0.314396, lagrangian_loss: -0.000247, attention_score_distillation_loss: 0.000506
ETA: 1 day, 7:22:20 | Epoch 4 finished. Took 3097.1 seconds.
----------------------------------------------------------------------
time: 2023-07-19 19:14:26
Evaluating: accuracy: 0.898, eval_loss: 0.5246, token_prune_loc: [True, False, True, True, True, True, True, True, True], macs_sparsity: 0.3524, expected_sparsity: 0.3308, expected_sequence_sparsity: 0.8438, target_sparsity: 0.3258, step: 57000
lambda_1: -3.3939, lambda_2: 326.0374 lambda_3: 0.0000
train remain: [0.97 0.99 0.92 0.82 0.75 0.9  0.83 0.84 0.51]
infer remain: [0.96, 1.0, 0.92, 0.82, 0.74, 0.9, 0.82, 0.84, 0.5]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.96, 0.88, 0.72, 0.54, 0.48, 0.4, 0.33, 0.17]
11111111111111111111111111111111111111111111111100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111110101100
11111111111111111111111111111111111111100110000000
11111111111111111111111110111111111010010010000000
11111111111111111111111110111111111111111111100100
10111111111111111111111110111111111011111110100000
11111111111111111111101110111111111110101011110100
00101111111111111111111110010100000000100000000000
loss: 0.010384, lagrangian_loss: -0.000795, attention_score_distillation_loss: 0.000491
loss: 0.819248, lagrangian_loss: 0.000366, attention_score_distillation_loss: 0.000479
----------------------------------------------------------------------
time: 2023-07-19 19:28:55
Evaluating: accuracy: 0.8998, eval_loss: 0.5425, token_prune_loc: [True, False, True, True, True, True, True, True, True], macs_sparsity: 0.3635, expected_sparsity: 0.3468, expected_sequence_sparsity: 0.8475, target_sparsity: 0.343, step: 60000
lambda_1: -0.7647, lambda_2: 344.1528 lambda_3: 0.0000
train remain: [0.97 0.99 0.91 0.81 0.75 0.87 0.83 0.81 0.49]
infer remain: [0.96, 1.0, 0.9, 0.8, 0.74, 0.88, 0.82, 0.82, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.96, 0.86, 0.69, 0.51, 0.45, 0.37, 0.3, 0.15]
11111111111111111111111111111111111111111111111100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111110001100
11111111111111111111111111111111111111100010000000
11111111111111111111111110111111111010010010000000
10111111111111111111111110111111111111111111100100
10111111111111111111111110111111111011111110100000
10111111111111111111101110111111111110101011110100
00101111111111111111111010010100000000100000000000
loss: 0.371095, lagrangian_loss: -0.000127, attention_score_distillation_loss: 0.000465
loss: 0.383557, lagrangian_loss: -0.007696, attention_score_distillation_loss: 0.000452
----------------------------------------------------------------------
time: 2023-07-19 19:43:28
Evaluating: accuracy: 0.8984, eval_loss: 0.5257, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.3848, expected_sparsity: 0.367, expected_sequence_sparsity: 0.8523, target_sparsity: 0.3601, step: 63000
lambda_1: -2.5047, lambda_2: 366.3772 lambda_3: 0.0000
train remain: [0.97 0.98 0.89 0.79 0.73 0.87 0.82 0.79 0.48]
infer remain: [0.96, 0.98, 0.88, 0.8, 0.72, 0.86, 0.82, 0.8, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.83, 0.66, 0.48, 0.41, 0.34, 0.27, 0.13]
11111111111111111111111111111111111111111111111100
11111111111111111111111111111111111111111111111110
11111111111111111111111111111111111111111110001000
11111111111111111111111111111111111111100010000000
11111111111111111111111110111111111010010000000000
10111111111111111111111110111111111111111110100100
10111111111111111111111110111111111011111110100000
10111111111111111111101110111111111110101011110000
00111111110111111111111010010100000000100000000000
loss: 0.127202, lagrangian_loss: 0.000266, attention_score_distillation_loss: 0.000439
loss: 0.121989, lagrangian_loss: 0.001641, attention_score_distillation_loss: 0.000421
----------------------------------------------------------------------
time: 2023-07-19 19:57:55
Evaluating: accuracy: 0.8995, eval_loss: 0.4993, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4042, expected_sparsity: 0.3851, expected_sequence_sparsity: 0.8565, target_sparsity: 0.3773, step: 66000
lambda_1: -2.7617, lambda_2: 379.9322 lambda_3: 0.0000
train remain: [0.97 0.98 0.85 0.77 0.72 0.86 0.82 0.76 0.48]
infer remain: [0.96, 0.98, 0.84, 0.78, 0.72, 0.86, 0.82, 0.76, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.79, 0.62, 0.44, 0.38, 0.31, 0.24, 0.11]
11111111111111111111111111111111111111111111111100
11111111111111111111111111111111111111111111111110
11111111111111111111111110111111111111011110001000
11111111111111111111111111111111111111100000000000
11111111111111111111111110111111111010010000000000
10111111111111111111111110111111111111111110100100
10111111111111111111111110111111111011111110100000
10111111111111111111101110111111111010001011110000
00111111110111111111111010010100000000100000000000
loss: 0.031277, lagrangian_loss: -0.003024, attention_score_distillation_loss: 0.000413
loss: 0.289318, lagrangian_loss: -0.005275, attention_score_distillation_loss: 0.000401
ETA: 1 day, 6:39:01 | Epoch 5 finished. Took 3337.65 seconds.
----------------------------------------------------------------------
time: 2023-07-19 20:12:22
Evaluating: accuracy: 0.8951, eval_loss: 0.5208, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4144, expected_sparsity: 0.398, expected_sequence_sparsity: 0.8596, target_sparsity: 0.3944, step: 69000
lambda_1: -2.4025, lambda_2: 396.2157 lambda_3: 0.0000
train remain: [0.97 0.98 0.84 0.75 0.69 0.86 0.82 0.73 0.48]
infer remain: [0.96, 0.98, 0.84, 0.76, 0.68, 0.86, 0.82, 0.72, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.79, 0.6, 0.41, 0.35, 0.29, 0.21, 0.1]
11111111111111111111111111111111111111111111111100
11111111111111111111111111111111111111111111111110
11111111111111111111111110111111111111011110001000
11111111111111111111111111111111111111000000000000
10111111111111111111111110111111111010000000000000
10111111111111111111111110111111111111111110100100
10111111111111111111111110111111111011111110100000
00111111111111111111101110111111011010001011110000
00101111110111111111111010010100000000110000000000
loss: 0.346848, lagrangian_loss: 0.008622, attention_score_distillation_loss: 0.000380
loss: 0.120472, lagrangian_loss: 0.007903, attention_score_distillation_loss: 0.000367
----------------------------------------------------------------------
time: 2023-07-19 20:26:48
Evaluating: accuracy: 0.9013, eval_loss: 0.4974, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4255, expected_sparsity: 0.4128, expected_sequence_sparsity: 0.8631, target_sparsity: 0.4116, step: 72000
lambda_1: -6.2552, lambda_2: 414.8770 lambda_3: 0.0000
train remain: [0.96 0.98 0.82 0.73 0.67 0.84 0.82 0.7  0.44]
infer remain: [0.96, 0.98, 0.82, 0.74, 0.68, 0.82, 0.82, 0.7, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.77, 0.57, 0.39, 0.32, 0.26, 0.18, 0.08]
11111111111111111111111111111111111111111111111100
11111111111111111111111111111111111111111111111110
11111111111111111111111110111111111111011100001000
11111111111111111111111110111111111111000000000000
10111111111111111111111110111111111010000000000000
10111111111111111111111110111111111111110110100000
10111111111111111111111110111111111011111110100000
00111111111111111111101110111111011010001010110000
00001111110111111111011010010100000000100000000001
loss: 0.267200, lagrangian_loss: 0.009891, attention_score_distillation_loss: 0.000353
loss: 0.012345, lagrangian_loss: -0.000463, attention_score_distillation_loss: 0.000349
----------------------------------------------------------------------
time: 2023-07-19 20:41:21
Evaluating: accuracy: 0.8968, eval_loss: 0.5501, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4551, expected_sparsity: 0.4367, expected_sequence_sparsity: 0.8687, target_sparsity: 0.4287, step: 75000
lambda_1: -4.8097, lambda_2: 432.0788 lambda_3: 0.0000
train remain: [0.96 0.97 0.79 0.73 0.67 0.83 0.82 0.67 0.39]
infer remain: [0.96, 0.96, 0.78, 0.72, 0.66, 0.82, 0.82, 0.68, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.92, 0.72, 0.52, 0.34, 0.28, 0.23, 0.16, 0.06]
11111111111111111111111111111111111111111111111100
11111111111111111111111110111111111111111111111110
11111111111111111111111110111111111111011000000000
11111111111111111111111110111111111110000000000000
10111111111111111111111110111111101010000000000000
10111111111111111111111110111111111111110110100000
10111111111111111111111110111111111011111110100000
00111111111111111111101110111111011010001000110000
00001111110111101111011010010100000000100000000000
loss: 0.197921, lagrangian_loss: 0.009890, attention_score_distillation_loss: 0.000330
loss: 0.920879, lagrangian_loss: -0.004288, attention_score_distillation_loss: 0.000323
----------------------------------------------------------------------
time: 2023-07-19 20:55:51
Evaluating: accuracy: 0.8948, eval_loss: 0.4981, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4661, expected_sparsity: 0.4508, expected_sequence_sparsity: 0.872, target_sparsity: 0.4459, step: 78000
lambda_1: -3.2839, lambda_2: 450.0362 lambda_3: 0.0000
train remain: [0.96 0.97 0.76 0.71 0.64 0.81 0.82 0.64 0.36]
infer remain: [0.96, 0.96, 0.76, 0.7, 0.64, 0.8, 0.82, 0.64, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.92, 0.7, 0.49, 0.31, 0.25, 0.21, 0.13, 0.05]
11111111111111111111111111111111111111111111111100
11111111111111111111111110111111111111111111111110
11111111111111111111111110111111111011011000000000
10111111111111111111111110111111111110000000000000
10111111111111111111111110111111101000000000000000
10111111111111111111111110111111111011110110100000
10111111111111111111111110111111111011111110100000
00101111111111111111101110111111011000001000110000
00001111110110101011011010010000000000100001000000
loss: 0.069720, lagrangian_loss: -0.005718, attention_score_distillation_loss: 0.000312
loss: 0.394201, lagrangian_loss: 0.000193, attention_score_distillation_loss: 0.000298
ETA: 1 day, 5:52:26 | Epoch 6 finished. Took 3340.85 seconds.
----------------------------------------------------------------------
time: 2023-07-19 21:10:18
Evaluating: accuracy: 0.8954, eval_loss: 0.5641, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4874, expected_sparsity: 0.4685, expected_sequence_sparsity: 0.8761, target_sparsity: 0.463, step: 81000
lambda_1: -1.6612, lambda_2: 467.1519 lambda_3: 0.0000
train remain: [0.97 0.95 0.74 0.69 0.63 0.79 0.74 0.59 0.35]
infer remain: [0.96, 0.94, 0.74, 0.7, 0.62, 0.78, 0.74, 0.58, 0.34]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.9, 0.67, 0.47, 0.29, 0.23, 0.17, 0.1, 0.03]
11111111111111111111111111111111111111111111111100
11111111111111111111111110111111111111111111111100
11111111111111111111111110111111101011011000000000
10111111111111111111111110111111111110000000000000
10111111111111111111111110111111001000000000000000
10011111111111111111111110111111111011110110100000
10011101110111111011111110111111111011111110100000
00101111110111111011101010111111011000001000110000
00001111010110101011011010010000000001100000000000
loss: 0.779751, lagrangian_loss: -0.000404, attention_score_distillation_loss: 0.000285
loss: 0.391330, lagrangian_loss: 0.009232, attention_score_distillation_loss: 0.000269
----------------------------------------------------------------------
time: 2023-07-19 21:24:47
Evaluating: accuracy: 0.8982, eval_loss: 0.5389, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4985, expected_sparsity: 0.4826, expected_sequence_sparsity: 0.8794, target_sparsity: 0.4802, step: 84000
lambda_1: -4.5302, lambda_2: 485.2544 lambda_3: 0.0000
train remain: [0.96 0.94 0.72 0.66 0.61 0.78 0.71 0.49 0.35]
infer remain: [0.96, 0.94, 0.72, 0.66, 0.62, 0.76, 0.7, 0.48, 0.34]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.9, 0.65, 0.43, 0.27, 0.2, 0.14, 0.07, 0.02]
11111111111111111111111111111111111111111111111100
11111111111111111111111110111111111111111111111100
11111111111111111111111110111111101011010000000000
10111111111111111111111110111111110100000000000000
10111111111111111111111110111111001000000000000000
10011111111111111111111110111111011011110110100000
10011101110111111011111110111111011011110110100000
00001101110111111011101010011101011000000000110000
00001111010110101011011010010000000000100001000000
loss: 0.184746, lagrangian_loss: 0.000812, attention_score_distillation_loss: 0.000253
loss: 0.169796, lagrangian_loss: -0.004998, attention_score_distillation_loss: 0.000246
----------------------------------------------------------------------
time: 2023-07-19 21:39:16
Evaluating: accuracy: 0.8931, eval_loss: 0.532, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5179, expected_sparsity: 0.5012, expected_sequence_sparsity: 0.8838, target_sparsity: 0.4973, step: 87000
lambda_1: -4.9811, lambda_2: 503.5086 lambda_3: 0.0000
train remain: [0.95 0.94 0.71 0.62 0.61 0.73 0.67 0.45 0.32]
infer remain: [0.94, 0.94, 0.7, 0.62, 0.62, 0.72, 0.66, 0.44, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 0.94, 0.88, 0.62, 0.38, 0.24, 0.17, 0.11, 0.05, 0.02]
11111111111111111111111111111111111111111111111000
11111111111111111111111110111111111111111111111100
11111111111111111111111110111111101010010000000000
10111111111111111111111110111111100000000000000000
10111111111111111111111110111110001000000000100000
10011111111111111011111110011111011011110110100000
10011101110111111011111010111101011011110110100000
00001101110111101011001010011101011000000000110000
00001111010110101011011010010000000000100000000000
loss: 0.455696, lagrangian_loss: 0.014450, attention_score_distillation_loss: 0.000231
loss: 0.308696, lagrangian_loss: 0.001392, attention_score_distillation_loss: 0.000217
----------------------------------------------------------------------
time: 2023-07-19 21:53:47
Evaluating: accuracy: 0.8944, eval_loss: 0.4786, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.529, expected_sparsity: 0.5139, expected_sequence_sparsity: 0.8868, target_sparsity: 0.5145, step: 90000
lambda_1: -4.3059, lambda_2: 521.4919 lambda_3: 0.0000
train remain: [0.93 0.93 0.67 0.62 0.6  0.69 0.63 0.41 0.31]
infer remain: [0.92, 0.94, 0.68, 0.62, 0.6, 0.7, 0.62, 0.42, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 0.92, 0.86, 0.59, 0.36, 0.22, 0.15, 0.09, 0.04, 0.01]
11111111111111111111111111111111111111111011111000
11111111111111111111111110111111111111111111111100
11111111111111111111111110111111101010000000000000
10111111111111111111111110111110101000000000000000
10111111111111111111111110111110001000000000000000
10011111111111111011111110011101011011110110100000
10001101110111111011111010011101011011110110100000
00001100110111101011001010011100011000000100110000
00000111010110101011011010000000000100100001000000
loss: 0.593622, lagrangian_loss: 0.001316, attention_score_distillation_loss: 0.000204
ETA: 1 day, 5:03:27 | Epoch 7 finished. Took 3338.89 seconds.
loss: 0.123990, lagrangian_loss: 0.002299, attention_score_distillation_loss: 0.000193
----------------------------------------------------------------------
time: 2023-07-19 22:08:20
Evaluating: accuracy: 0.8959, eval_loss: 0.5149, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.554, expected_sparsity: 0.5383, expected_sequence_sparsity: 0.8925, target_sparsity: 0.5316, step: 93000
lambda_1: -7.9677, lambda_2: 538.8955 lambda_3: 0.0000
train remain: [0.89 0.91 0.67 0.62 0.6  0.67 0.6  0.37 0.3 ]
infer remain: [0.88, 0.9, 0.66, 0.62, 0.6, 0.66, 0.6, 0.38, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 0.88, 0.79, 0.52, 0.32, 0.19, 0.13, 0.08, 0.03, 0.01]
11111111111111111111111110111111111111111011110000
11111111111111111111111110111111111111111101110100
11111111111111111111111110111111101000000000000000
10111111111111111111111110111110100000000000000100
10111111111111111111111110111100001000000100000000
10001111110111111011111110011101011011110110100000
10001101110111111011011010011101011011110110100000
00001100110110101011001010011100011000000100100000
10000111010110101011011010000000000000100000000000
loss: 0.035226, lagrangian_loss: -0.010783, attention_score_distillation_loss: 0.000178
loss: 0.611543, lagrangian_loss: 0.013423, attention_score_distillation_loss: 0.000163
----------------------------------------------------------------------
time: 2023-07-19 22:22:52
Evaluating: accuracy: 0.8924, eval_loss: 0.4819, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5623, expected_sparsity: 0.5483, expected_sequence_sparsity: 0.8949, target_sparsity: 0.5488, step: 96000
lambda_1: -4.6683, lambda_2: 556.6636 lambda_3: 0.0000
train remain: [0.86 0.9  0.64 0.61 0.6  0.61 0.58 0.35 0.3 ]
infer remain: [0.86, 0.9, 0.64, 0.62, 0.6, 0.62, 0.58, 0.36, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 0.86, 0.77, 0.5, 0.31, 0.18, 0.11, 0.07, 0.02, 0.01]
11111111111111111111111110111111111111111011010000
11111111111111111111111110111111111111111101110100
11111111111111111111111110110111101000000000000000
10111111111111111111111110111110100000000000100000
10111111111111111111111110111100001001000000000000
10001111110111101011111010011101011011110110100000
10001101110111101011011010011101011011110110100000
10001100110110101011001010010100011000000000100000
10000111010010101011011010000001000000100000000000
loss: 0.231981, lagrangian_loss: 0.007939, attention_score_distillation_loss: 0.000153
loss: 0.162509, lagrangian_loss: -0.010745, attention_score_distillation_loss: 0.000140
----------------------------------------------------------------------
time: 2023-07-19 22:37:21
Evaluating: accuracy: 0.8893, eval_loss: 0.5444, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.579, expected_sparsity: 0.5658, expected_sequence_sparsity: 0.899, target_sparsity: 0.5659, step: 99000
lambda_1: -5.4141, lambda_2: 573.7689 lambda_3: 0.0000
train remain: [0.85 0.86 0.63 0.6  0.54 0.6  0.56 0.35 0.27]
infer remain: [0.84, 0.86, 0.64, 0.6, 0.54, 0.6, 0.56, 0.36, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 0.84, 0.72, 0.46, 0.28, 0.15, 0.09, 0.05, 0.02, 0.0]
11111111111111111111111110111111111111110011010000
10111111111111111111011110111111111111111101110100
11111111111111111111111110110111100010000000000000
10111111111111111111111110111110100000000000000000
10011111111111111011111110111100001000000000000000
10001111110111101011011010011101011011110110100000
10001101110110101011011010011101011001110111100000
00001100110110101001001010010100011000010100100000
10000111010010101011011010000000000000000000000000
loss: 0.315823, lagrangian_loss: 0.000172, attention_score_distillation_loss: 0.000128
loss: 0.141669, lagrangian_loss: -0.009282, attention_score_distillation_loss: 0.000114
----------------------------------------------------------------------
time: 2023-07-19 22:51:51
Evaluating: accuracy: 0.8854, eval_loss: 0.5078, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5929, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.9027, target_sparsity: 0.5831, step: 102000
lambda_1: -7.4074, lambda_2: 590.8319 lambda_3: 0.0000
train remain: [0.82 0.83 0.6  0.6  0.5  0.58 0.56 0.34 0.23]
infer remain: [0.82, 0.84, 0.6, 0.6, 0.5, 0.58, 0.56, 0.34, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.82, 0.69, 0.41, 0.25, 0.12, 0.07, 0.04, 0.01, 0.0]
11111111111111111111111110111111111111110001010000
10111111111111111111011110111111111110111101110100
10111111111111111111111110110111100000000000000000
10111111111111111111111110111110001000000000000000
10011111111111111011111010011100001000000000000000
10001011110111101011011010011101011011110110100000
10001101110110101011011010011101011001110111100000
00001100110110101001001010010100011001000000100000
10000011010010101011010010000000000001000000000000
loss: 0.289435, lagrangian_loss: 0.002069, attention_score_distillation_loss: 0.000101
ETA: 1 day, 4:13:19 | Epoch 8 finished. Took 3345.01 seconds.
loss: 0.402243, lagrangian_loss: -0.007613, attention_score_distillation_loss: 0.000089
----------------------------------------------------------------------
time: 2023-07-19 23:06:16
Evaluating: accuracy: 0.8868, eval_loss: 0.5376, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.615, expected_sparsity: 0.6017, expected_sequence_sparsity: 0.9074, target_sparsity: 0.6002, step: 105000
lambda_1: -8.2756, lambda_2: 607.7910 lambda_3: 0.0000
train remain: [0.81 0.78 0.56 0.6  0.46 0.54 0.54 0.3  0.22]
infer remain: [0.8, 0.78, 0.56, 0.6, 0.46, 0.54, 0.54, 0.3, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 0.8, 0.62, 0.35, 0.21, 0.1, 0.05, 0.03, 0.01, 0.0]
10111111111111111111111110111111111111110001010000
10111111110111111111011110111111111110010101110100
10111111111111111111111110110110000000000000000000
10111111111111111111111110111111000000000000000000
10011111110111101011011010011100001001000000000000
10000011110110101011011010011101011011110110100000
10001101110110101011011010011101010001110110100001
00000100110110101001001010010100011000000000100000
00000011010010101011010010000000010000000000000000
loss: 0.066504, lagrangian_loss: 0.006595, attention_score_distillation_loss: 0.000074
loss: 0.568262, lagrangian_loss: -0.009137, attention_score_distillation_loss: 0.000063
----------------------------------------------------------------------
time: 2023-07-19 23:20:42
Evaluating: accuracy: 0.8866, eval_loss: 0.5264, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6289, expected_sparsity: 0.6162, expected_sequence_sparsity: 0.9109, target_sparsity: 0.6173, step: 108000
lambda_1: -7.4469, lambda_2: 626.0708 lambda_3: 0.0000
train remain: [0.78 0.74 0.53 0.58 0.44 0.48 0.45 0.24 0.18]
infer remain: [0.78, 0.74, 0.54, 0.58, 0.44, 0.48, 0.44, 0.24, 0.16]
layerwise remain: [1.0, 1.0, 1.0, 0.78, 0.58, 0.31, 0.18, 0.08, 0.04, 0.02, 0.0, 0.0]
10111111111111111111111110111111111111110001000000
10111111110111111111011110111101111110010100110100
10111111111111111111111110110100000000000000000000
10111111111111111111111110111110000000000000000000
10011111110111101011011010011100001000000000000000
10000011110110101011011010011101010001010110100000
10000001110110101011010010011101010001010110000001
10000000110010101001001010000100010001000000000000
10000010010010001001000010000000010000000000000000
loss: 0.894482, lagrangian_loss: 0.011798, attention_score_distillation_loss: 0.000049
loss: 0.571284, lagrangian_loss: 0.014897, attention_score_distillation_loss: 0.000036
----------------------------------------------------------------------
time: 2023-07-19 23:35:04
Evaluating: accuracy: 0.8851, eval_loss: 0.5031, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6483, expected_sparsity: 0.6345, expected_sequence_sparsity: 0.9152, target_sparsity: 0.6345, step: 111000
lambda_1: -11.8927, lambda_2: 643.7094 lambda_3: 0.0000
train remain: [0.75 0.71 0.51 0.5  0.41 0.44 0.41 0.22 0.3 ]
infer remain: [0.74, 0.72, 0.5, 0.5, 0.4, 0.44, 0.4, 0.22, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.53, 0.27, 0.13, 0.05, 0.02, 0.01, 0.0, 0.0]
00111111111111111111111110111111111111110000000000
10111111110011111111011110111101111110010100110100
10111111111111111111101110100100000000000000000000
10011111111111101011111110111100000000000000000000
10001111110110101011011010011100001000000000000000
10000011110010101011011010011101010001010110000000
10000001110010101011010010001101010001010110000001
10000000110010101001000010000100010001000000000000
10000010010010001001000010000000010001000000000000
loss: 0.458527, lagrangian_loss: -0.009432, attention_score_distillation_loss: 0.000024
loss: 0.104331, lagrangian_loss: -0.014244, attention_score_distillation_loss: 0.000011
ETA: 1 day, 3:10:17 | Epoch 9 finished. Took 3108.96 seconds.
Starting saving the best from epoch 10 and step 114000
----------------------------------------------------------------------
time: 2023-07-19 23:49:26
Evaluating: accuracy: 0.8817, eval_loss: 0.5588, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6502, expected_sequence_sparsity: 0.9188, target_sparsity: 0.65, step: 114000
lambda_1: -9.3703, lambda_2: 665.1255 lambda_3: 0.0000
train remain: [0.72 0.66 0.46 0.44 0.39 0.37 0.37 0.23 0.45]
infer remain: [0.72, 0.66, 0.46, 0.44, 0.38, 0.36, 0.36, 0.22, 0.16]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.04, 0.01, 0.0, 0.0, 0.0]
00111111111111111111111110111111111110110000000000
10101111110011111111010110111101111010010100110100
10111111111111111011101110100000000000000000000000
10001111110111101011111110011100000000000000000000
10000111110110101011011010011100000000000100000000
10000001110010101001010010001101010001010110000000
10000001110010101001010010001100010001010110000001
10000000110010101001000010000100010001000000000000
10000000010010001001000010000000010001000000000000
Saving the best model so far: [Epoch 10 | Step: 114000 | MACs sparsity: 0.6622 | Score: 0.8817 | Loss: 0.5588]
loss: 0.636109, lagrangian_loss: -0.008807, attention_score_distillation_loss: 0.000010
loss: 0.246043, lagrangian_loss: 0.000224, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 00:04:02
Evaluating: accuracy: 0.8825, eval_loss: 0.5941, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6622, expected_sparsity: 0.65, expected_sequence_sparsity: 0.9188, target_sparsity: 0.65, step: 117000
lambda_1: -2.0260, lambda_2: 680.1785 lambda_3: 0.0000
train remain: [0.72 0.66 0.46 0.44 0.38 0.35 0.37 0.22 0.76]
infer remain: [0.72, 0.66, 0.46, 0.44, 0.38, 0.36, 0.36, 0.22, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.04, 0.01, 0.0, 0.0, 0.0]
00111111111111111111111110111111111110110000000000
10101111110011111111010110111101111010110100100100
10111111111111111011100110100000000000000000001000
10001111110111101011011110011100000001000000000000
10000111110110101011011010011100000000000100000000
10000011110010101001010010001101010001010100000000
10000001110010101001010010001100010001010110000001
10000000110010101001000010000100010001000000000000
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8817 @ step 114000 epoch 10.03
Saving the best model so far: [Epoch 10 | Step: 117000 | MACs sparsity: 0.6622 | Score: 0.8825 | Loss: 0.5941]
loss: 0.222651, lagrangian_loss: 0.012709, attention_score_distillation_loss: 0.000010
loss: 0.130893, lagrangian_loss: 0.000657, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 00:19:01
Evaluating: accuracy: 0.8886, eval_loss: 0.5467, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6622, expected_sparsity: 0.6503, expected_sequence_sparsity: 0.9189, target_sparsity: 0.65, step: 120000
lambda_1: -3.4723, lambda_2: 696.2979 lambda_3: 0.0000
train remain: [0.72 0.66 0.45 0.44 0.37 0.33 0.31 0.22 0.84]
infer remain: [0.72, 0.66, 0.46, 0.44, 0.38, 0.32, 0.3, 0.22, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.04, 0.01, 0.0, 0.0, 0.0]
00111111111111111111111110111111111110110000000000
10111111110011111111010110111101111010010100100100
10111111111111111111100110100000000000000000000000
10001111110111101011011110011100000001000000000000
10000111110110101011011010001100000000010010000000
10000001110010101001010010000100010001010100000001
10000000110010101001010010000100010001010100000001
10000000110010001001000010000100010001010000000000
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8825 @ step 117000 epoch 10.29
Saving the best model so far: [Epoch 10 | Step: 120000 | MACs sparsity: 0.6622 | Score: 0.8886 | Loss: 0.5467]
loss: 0.149866, lagrangian_loss: -0.003486, attention_score_distillation_loss: 0.000010
loss: 0.057800, lagrangian_loss: -0.000276, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 00:33:44
Evaluating: accuracy: 0.8865, eval_loss: 0.4985, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6505, expected_sequence_sparsity: 0.9189, target_sparsity: 0.65, step: 123000
lambda_1: -1.6749, lambda_2: 713.5250 lambda_3: 0.0000
train remain: [0.72 0.66 0.46 0.44 0.36 0.31 0.29 0.23 0.93]
infer remain: [0.72, 0.66, 0.46, 0.44, 0.36, 0.32, 0.3, 0.22, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0]
00111111111111111111111110111111111110110000000000
11101111110011111111010110111101111010010100100100
10111111111111111011100110100100000000000000000000
10001111110111101011011010011101000001000000000000
10000111110110101011011010001100000000010000000000
10000001110010101001010010000100010001010100000001
10000000110010101001000010000100010001010100000011
10000000110010001001000010000100010001010000000000
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8886 @ step 120000 epoch 10.55
loss: 0.205889, lagrangian_loss: 0.005737, attention_score_distillation_loss: 0.000010
loss: 0.477401, lagrangian_loss: 0.000599, attention_score_distillation_loss: 0.000010
ETA: 1 day, 2:21:12 | Epoch 10 finished. Took 3380.22 seconds.
----------------------------------------------------------------------
time: 2023-07-20 00:48:01
Evaluating: accuracy: 0.8843, eval_loss: 0.5702, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6509, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 126000
lambda_1: -1.5006, lambda_2: 730.7234 lambda_3: 0.0000
train remain: [0.72 0.66 0.46 0.44 0.34 0.3  0.28 0.24 0.95]
infer remain: [0.72, 0.66, 0.46, 0.44, 0.34, 0.3, 0.28, 0.24, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0]
00111111111111111111111110111111111110110000000000
10111111110011111111010110111101111010010100100100
10111111111111111111110110000000000000000000000000
10001111110111101011011010011101000001000000000000
10000111110010101011011010001100000000010000000000
10000001110010101001010010000100010001010100000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010000000001
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8886 @ step 120000 epoch 10.55
loss: 0.278874, lagrangian_loss: -0.000694, attention_score_distillation_loss: 0.000010
loss: 0.249959, lagrangian_loss: 0.008280, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 01:02:23
Evaluating: accuracy: 0.8899, eval_loss: 0.5189, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6509, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 129000
lambda_1: -1.2810, lambda_2: 747.8193 lambda_3: 0.0000
train remain: [0.72 0.66 0.46 0.45 0.34 0.3  0.28 0.25 0.96]
infer remain: [0.72, 0.66, 0.46, 0.44, 0.34, 0.3, 0.28, 0.24, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0]
00111111111111111111111110111111111110110000000000
10111111110011111111010110111101111010010100100100
10111111111111111111100110100000000000000000000000
10001111110111101011011010011100000101000000000000
10000111110010101011010010001100000000010010000000
10000001110010101001000010000100010001010110000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8886 @ step 120000 epoch 10.55
Saving the best model so far: [Epoch 11 | Step: 129000 | MACs sparsity: 0.6649 | Score: 0.8899 | Loss: 0.5189]
loss: 0.531214, lagrangian_loss: -0.000237, attention_score_distillation_loss: 0.000010
loss: 0.436766, lagrangian_loss: 0.002168, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 01:16:59
Evaluating: accuracy: 0.8302, eval_loss: 0.7375, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6509, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 132000
lambda_1: -1.7529, lambda_2: 764.9451 lambda_3: 0.0000
train remain: [0.72 0.66 0.46 0.45 0.34 0.29 0.28 0.25 0.96]
infer remain: [0.72, 0.66, 0.46, 0.44, 0.34, 0.3, 0.28, 0.24, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0]
00111111111111111111111110111111111110110000000000
10111111110011111111010110111101111010010100100100
10111111111111111111100110010000000000000000000000
10001111110111101011011010011101000001000000000000
10000111110010101011010010001100000000010100000000
10000000110010101001000010001100010001010110000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010000000001
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8899 @ step 129000 epoch 11.34
loss: 0.528980, lagrangian_loss: -0.000636, attention_score_distillation_loss: 0.000010
loss: 0.455666, lagrangian_loss: 0.015322, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 01:31:19
Evaluating: accuracy: 0.8904, eval_loss: 0.495, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6506, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 135000
lambda_1: -1.3714, lambda_2: 782.2104 lambda_3: 0.0000
train remain: [0.72 0.66 0.47 0.46 0.36 0.29 0.28 0.25 0.96]
infer remain: [0.72, 0.66, 0.46, 0.44, 0.36, 0.3, 0.28, 0.24, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0]
00111111111111111111111110111111111110100000010000
10101111110011111111010110111101111010010110100100
10111111111111111111100110010000000000000000000000
10001111110111101011011010011100010001000000000000
10000111110010101011010010001100010001010000000000
10000000110010101001000010000101010001010110000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8899 @ step 129000 epoch 11.34
Saving the best model so far: [Epoch 11 | Step: 135000 | MACs sparsity: 0.6649 | Score: 0.8904 | Loss: 0.495]
loss: 0.242830, lagrangian_loss: 0.000136, attention_score_distillation_loss: 0.000010
ETA: 1 day, 1:29:56 | Epoch 11 finished. Took 3355.22 seconds.
loss: 0.639652, lagrangian_loss: 0.003957, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 01:46:13
Evaluating: accuracy: 0.8949, eval_loss: 0.5042, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6507, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 138000
lambda_1: -2.2353, lambda_2: 799.8128 lambda_3: 0.0000
train remain: [0.72 0.66 0.46 0.46 0.37 0.29 0.28 0.25 0.96]
infer remain: [0.72, 0.66, 0.46, 0.44, 0.36, 0.28, 0.28, 0.24, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0]
01111111111111111111111110111111111110100000000000
10111111110011111111010110111101111010010100100100
10111111111111111111100110010000000000000000000000
10001111110111101011011010011100010001000000000000
10000111110010101011010010001100010001010000000000
10000000110010101001000010000100010001010110000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8904 @ step 135000 epoch 11.87
Saving the best model so far: [Epoch 12 | Step: 138000 | MACs sparsity: 0.6649 | Score: 0.8949 | Loss: 0.5042]
loss: 0.164816, lagrangian_loss: -0.000475, attention_score_distillation_loss: 0.000010
loss: 0.077830, lagrangian_loss: -0.001258, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 02:01:08
Evaluating: accuracy: 0.8944, eval_loss: 0.5021, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6622, expected_sparsity: 0.6502, expected_sequence_sparsity: 0.9188, target_sparsity: 0.65, step: 141000
lambda_1: -1.8924, lambda_2: 816.6750 lambda_3: 0.0000
train remain: [0.72 0.66 0.46 0.46 0.36 0.29 0.28 0.27 0.96]
infer remain: [0.72, 0.66, 0.46, 0.46, 0.36, 0.28, 0.28, 0.24, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.04, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000000
10111111110011111111010110111101111010010100100100
10111111111111111111100110010000000000000000000000
10001111110111101011011010011100010101000000000000
10000111110010101011010010001100010001010000000000
10000000110010101001000010000100010001010100000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8949 @ step 138000 epoch 12.14
loss: 0.391748, lagrangian_loss: 0.001875, attention_score_distillation_loss: 0.000010
loss: 0.035931, lagrangian_loss: -0.001861, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 02:15:27
Evaluating: accuracy: 0.8911, eval_loss: 0.5233, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6517, expected_sequence_sparsity: 0.9192, target_sparsity: 0.65, step: 144000
lambda_1: -2.2427, lambda_2: 833.6587 lambda_3: 0.0000
train remain: [0.72 0.66 0.46 0.48 0.35 0.28 0.28 0.29 0.95]
infer remain: [0.72, 0.66, 0.44, 0.46, 0.34, 0.28, 0.28, 0.24, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0]
01111111111111111111111110111111111110100000000000
10111111110011111111010110111101111010010100100100
10111111111111111011100110100000000000000000000000
10001111110111101011011010011100010101000000000000
10000111110010101011010010001100010000010000000000
10000000110010101001010010000100010001010100000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010000000001
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8949 @ step 138000 epoch 12.14
loss: 0.833562, lagrangian_loss: 0.003554, attention_score_distillation_loss: 0.000010
loss: 0.029055, lagrangian_loss: -0.000370, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 02:29:49
Evaluating: accuracy: 0.8857, eval_loss: 0.5672, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6504, expected_sequence_sparsity: 0.9189, target_sparsity: 0.65, step: 147000
lambda_1: -1.6238, lambda_2: 850.7535 lambda_3: 0.0000
train remain: [0.72 0.66 0.47 0.47 0.35 0.29 0.28 0.39 0.96]
infer remain: [0.72, 0.66, 0.46, 0.46, 0.34, 0.28, 0.28, 0.24, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0]
01111111111111111111111110111111111110100000000000
10111111110011111111010110111101111010010100100100
10111111111111111011100110101000000000000000000000
10001111110111101011011010011100010001000010000000
10000111110010101011010010001100010000010000000000
10000001110010101001000010000100010001010100000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8949 @ step 138000 epoch 12.14
loss: 0.978059, lagrangian_loss: 0.005119, attention_score_distillation_loss: 0.000010
ETA: 1 day, 0:37:22 | Epoch 12 finished. Took 3338.46 seconds.
loss: 0.472538, lagrangian_loss: 0.010825, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 02:44:10
Evaluating: accuracy: 0.8881, eval_loss: 0.5683, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6504, expected_sequence_sparsity: 0.9189, target_sparsity: 0.65, step: 150000
lambda_1: -1.7713, lambda_2: 866.8922 lambda_3: 0.0000
train remain: [0.72 0.66 0.47 0.47 0.36 0.29 0.28 0.43 0.95]
infer remain: [0.72, 0.66, 0.46, 0.46, 0.34, 0.28, 0.28, 0.24, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0]
01111111111111111111111110111111111110100000000000
10101111111011111111010110111101111010010100100100
10111111111111111011100110000001100000000000000000
10001111110111101011011010011100010001000010000000
10000111110010101011010010001100010000010000000000
10000000110010101001000010000100010001010110000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8949 @ step 138000 epoch 12.14
loss: 0.023684, lagrangian_loss: -0.000878, attention_score_distillation_loss: 0.000010
loss: 0.013496, lagrangian_loss: 0.000733, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 02:58:30
Evaluating: accuracy: 0.8935, eval_loss: 0.4978, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6517, expected_sequence_sparsity: 0.9192, target_sparsity: 0.65, step: 153000
lambda_1: -1.5745, lambda_2: 883.7991 lambda_3: 0.0000
train remain: [0.72 0.66 0.46 0.47 0.36 0.29 0.28 0.35 0.95]
infer remain: [0.72, 0.66, 0.44, 0.46, 0.34, 0.28, 0.28, 0.24, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000000
10101111110111111111010110111101111010010100100100
10111111111111111011100110000000000000000010000000
10001111110111101011011010011100010001000010000000
10000111110010101011010010001100010000010000000000
10000000110010101001010010000100010001010100000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8949 @ step 138000 epoch 12.14
loss: 0.093153, lagrangian_loss: 0.000674, attention_score_distillation_loss: 0.000010
loss: 0.120158, lagrangian_loss: -0.000088, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 03:13:04
Evaluating: accuracy: 0.8931, eval_loss: 0.5467, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6517, expected_sequence_sparsity: 0.9192, target_sparsity: 0.65, step: 156000
lambda_1: -1.8309, lambda_2: 900.7239 lambda_3: 0.0000
train remain: [0.72 0.66 0.46 0.47 0.35 0.29 0.28 0.31 0.96]
infer remain: [0.72, 0.66, 0.44, 0.46, 0.34, 0.28, 0.28, 0.24, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000000
10101111110011111111010110111101111010010100110100
10111111111111111011100110000000000100000000000000
10001111110111101011011010011100010001000010000000
10000011110010101011010010001100010001010000000000
10000000110010101001010010000100010001010100000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8949 @ step 138000 epoch 12.14
loss: 0.427451, lagrangian_loss: 0.002730, attention_score_distillation_loss: 0.000010
loss: 0.032086, lagrangian_loss: 0.001369, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 03:27:27
Evaluating: accuracy: 0.8967, eval_loss: 0.5235, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6504, expected_sequence_sparsity: 0.9189, target_sparsity: 0.65, step: 159000
lambda_1: -1.9873, lambda_2: 918.0266 lambda_3: 0.0000
train remain: [0.72 0.66 0.46 0.47 0.36 0.29 0.28 0.31 0.96]
infer remain: [0.72, 0.66, 0.46, 0.46, 0.34, 0.28, 0.28, 0.24, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0]
01111111111111111111111110111111111110100000000000
10101111111011111111010110111101111010010100100100
10111111111111111011100110010000010000000000000000
10001111110111101011011010011100010001000010000000
10000011110010101011010010001100010001010000000000
10000000110010101001000010000100010001010100000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8949 @ step 138000 epoch 12.14
Saving the best model so far: [Epoch 13 | Step: 159000 | MACs sparsity: 0.6649 | Score: 0.8967 | Loss: 0.5235]
loss: 0.513739, lagrangian_loss: 0.000643, attention_score_distillation_loss: 0.000010
ETA: 23:44:43 | Epoch 13 finished. Took 3349.85 seconds.
loss: 0.512911, lagrangian_loss: 0.008754, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 03:42:20
Evaluating: accuracy: 0.8982, eval_loss: 0.5212, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6517, expected_sequence_sparsity: 0.9192, target_sparsity: 0.65, step: 162000
lambda_1: -1.8915, lambda_2: 935.5117 lambda_3: 0.0000
train remain: [0.72 0.66 0.45 0.46 0.36 0.29 0.28 0.35 0.95]
infer remain: [0.72, 0.66, 0.44, 0.46, 0.34, 0.28, 0.28, 0.24, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000000
10101111110011111111110110111101111010010100100100
10111111111111111011100110010000000000000000000000
10001111110111101011011010011100010001000010000000
10000011110010101011010010001100010001010000000000
10000000110010101001000010000100010001010110000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8967 @ step 159000 epoch 13.98
Saving the best model so far: [Epoch 14 | Step: 162000 | MACs sparsity: 0.6649 | Score: 0.8982 | Loss: 0.5212]
loss: 0.050625, lagrangian_loss: -0.000414, attention_score_distillation_loss: 0.000010
loss: 0.146055, lagrangian_loss: -0.000026, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 03:57:24
Evaluating: accuracy: 0.8962, eval_loss: 0.5215, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6504, expected_sequence_sparsity: 0.9189, target_sparsity: 0.65, step: 165000
lambda_1: -1.1926, lambda_2: 953.1530 lambda_3: 0.0000
train remain: [0.72 0.66 0.45 0.46 0.35 0.28 0.29 0.4  0.94]
infer remain: [0.72, 0.66, 0.46, 0.46, 0.34, 0.28, 0.28, 0.26, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000000
10101111111011111111010110111101111010010100100100
10111111111111111011100110010000010000000000000000
10001111110111101011011010011100010001000010000000
10000011110010101011010010001101010000010000000000
10000001110010101001000010000100010001010100000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000001
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8982 @ step 162000 epoch 14.25
loss: 0.296017, lagrangian_loss: 0.000033, attention_score_distillation_loss: 0.000010
loss: 0.034512, lagrangian_loss: 0.001690, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 04:11:51
Evaluating: accuracy: 0.8993, eval_loss: 0.518, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6517, expected_sequence_sparsity: 0.9192, target_sparsity: 0.65, step: 168000
lambda_1: -1.6172, lambda_2: 969.9517 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.45 0.35 0.28 0.29 0.46 0.95]
infer remain: [0.72, 0.66, 0.44, 0.46, 0.34, 0.28, 0.28, 0.26, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000000
10101111110011111111010110111101111010010110100100
10111111111111111011100110000000010000000000000000
10001111110111101011011010011100010001000010000000
10000011110010101011010010001101010000010000000000
10000000110010101001000010000100010001010100000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000001
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8982 @ step 162000 epoch 14.25
Saving the best model so far: [Epoch 14 | Step: 168000 | MACs sparsity: 0.6649 | Score: 0.8993 | Loss: 0.518]
loss: 0.390874, lagrangian_loss: -0.000075, attention_score_distillation_loss: 0.000010
loss: 0.030909, lagrangian_loss: 0.000422, attention_score_distillation_loss: 0.000010
ETA: 22:48:01 | Epoch 14 finished. Took 3219.6 seconds.
----------------------------------------------------------------------
time: 2023-07-20 04:27:23
Evaluating: accuracy: 0.8982, eval_loss: 0.5322, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6522, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 171000
lambda_1: -2.3461, lambda_2: 987.2769 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.45 0.35 0.28 0.29 0.41 0.96]
infer remain: [0.72, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.26, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000000
10101111110011111111011110111101111010010100100100
10111111111111111011100110010000000000000000000000
10001111110110101011011010011100010001000010000000
10000011110010101011010010001101010000010000000000
10000000110010101001000010000100010101010100000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000001
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8993 @ step 168000 epoch 14.77
loss: 0.539738, lagrangian_loss: -0.001098, attention_score_distillation_loss: 0.000010
loss: 0.386357, lagrangian_loss: -0.000066, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 04:41:51
Evaluating: accuracy: 0.8994, eval_loss: 0.544, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6522, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 174000
lambda_1: -1.4821, lambda_2: 1004.1932 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.45 0.35 0.29 0.28 0.42 0.96]
infer remain: [0.72, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.26, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000000
10101111110111111111010110111101111010010100100100
10111111111111111011100110000000001000000000000000
10001111110110101011011010011100010001000010000000
10000011110010101011010010001100010000010100000000
10000000110010001001000010000100010101010100000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000001
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8993 @ step 168000 epoch 14.77
Saving the best model so far: [Epoch 15 | Step: 174000 | MACs sparsity: 0.6649 | Score: 0.8994 | Loss: 0.544]
loss: 0.404674, lagrangian_loss: -0.000497, attention_score_distillation_loss: 0.000010
loss: 0.028135, lagrangian_loss: 0.000778, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 04:57:06
Evaluating: accuracy: 0.8991, eval_loss: 0.5316, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6522, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 177000
lambda_1: -2.1925, lambda_2: 1022.0064 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.45 0.35 0.28 0.28 0.39 0.95]
infer remain: [0.72, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.26, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000000
10101111110011111111010110111101111110010100100100
10111111111111111011100110000000000000000000001000
10001111110110101011011010011100011001000000000000
10000011110010101011010010001100010000010100000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000001
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8994 @ step 174000 epoch 15.30
loss: 0.224978, lagrangian_loss: -0.000510, attention_score_distillation_loss: 0.000010
loss: 0.031553, lagrangian_loss: 0.004469, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 05:11:35
Evaluating: accuracy: 0.8966, eval_loss: 0.5306, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6522, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 180000
lambda_1: -1.2492, lambda_2: 1039.1138 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.45 0.35 0.28 0.28 0.37 0.95]
infer remain: [0.72, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.26, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000000
10101111110011111111011110111101111010010100100100
10111111111111111011100110000100000000000000000000
10001111110110101011011010011100010001000010000000
10000011110010101011010010001100010000011000000000
10000000110010001001000010001100010101010100000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000001
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8994 @ step 174000 epoch 15.30
loss: 0.048526, lagrangian_loss: -0.000178, attention_score_distillation_loss: 0.000010
loss: 0.336060, lagrangian_loss: 0.002401, attention_score_distillation_loss: 0.000010
ETA: 21:55:47 | Epoch 15 finished. Took 3382.28 seconds.
----------------------------------------------------------------------
time: 2023-07-20 05:26:02
Evaluating: accuracy: 0.8983, eval_loss: 0.5317, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6522, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 183000
lambda_1: -6.2507, lambda_2: 1056.3933 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.45 0.34 0.28 0.28 0.41 0.72]
infer remain: [0.72, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.26, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000000
10101111110111111111010110111101111010010100100100
10111111111111111011100110000000000000010000000000
10001111110110101011011010011100010001100000000000
10000011110010101011010010001100010000010010000000
10000000110010001001010010000100010001010100000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010000000011
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8994 @ step 174000 epoch 15.30
loss: 0.282242, lagrangian_loss: -0.003303, attention_score_distillation_loss: 0.000010
loss: 0.030904, lagrangian_loss: -0.003420, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 05:40:30
Evaluating: accuracy: 0.898, eval_loss: 0.5156, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6649, expected_sparsity: 0.6522, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 186000
lambda_1: -3.1781, lambda_2: 1073.7429 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.45 0.34 0.28 0.28 0.36 0.61]
infer remain: [0.72, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.26, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000000
10101111110011111111010110111101111010010101100100
10111111111111111011100110000000000001000000000000
10001111110110101011011010011100010001000001000000
10000011110010101011010010001100010000010100000000
10000010110010001001000010000101010001010100000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000001
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.8994 @ step 174000 epoch 15.30
loss: 0.012500, lagrangian_loss: 0.008274, attention_score_distillation_loss: 0.000010
loss: 0.026993, lagrangian_loss: 0.001338, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 05:54:58
Evaluating: accuracy: 0.8981, eval_loss: 0.5547, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6594, expected_sparsity: 0.6487, expected_sequence_sparsity: 0.9185, target_sparsity: 0.65, step: 189000
lambda_1: -1.8263, lambda_2: 1090.6847 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.45 0.34 0.28 0.28 0.39 0.73]
infer remain: [0.74, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.26, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000010000
10101111110011111111110110111101111010010100100100
10111111111111111011100110000000100000000000000000
10001111110110101011011010011100010001000001000000
10000011110010101011010010001100010001010000000000
10000000110010101001000010000100010001010100000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010000000011
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8994 @ step 174000 epoch 15.30
loss: 0.015875, lagrangian_loss: 0.002947, attention_score_distillation_loss: 0.000010
loss: 0.019508, lagrangian_loss: -0.001077, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 06:09:28
Evaluating: accuracy: 0.8984, eval_loss: 0.5226, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6594, expected_sparsity: 0.6487, expected_sequence_sparsity: 0.9185, target_sparsity: 0.65, step: 192000
lambda_1: -1.8976, lambda_2: 1107.8623 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.45 0.33 0.28 0.28 0.32 0.63]
infer remain: [0.74, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.26, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000010
10111111110011111111010110111101111010010100100100
10111111111111111011100110000000000000000000000100
10001111110110101011011010011100010001000001000000
10000011110010101011010010001100010001010000000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010000000011
11111111111111111111111111111111111111111111111111
Best eval score so far: 0.8994 @ step 174000 epoch 15.30
loss: 0.323650, lagrangian_loss: 0.006087, attention_score_distillation_loss: 0.000010
ETA: 21:01:50 | Epoch 16 finished. Took 3328.64 seconds.
loss: 0.019672, lagrangian_loss: 0.001162, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 06:23:42
Evaluating: accuracy: 0.8984, eval_loss: 0.5406, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6488, expected_sequence_sparsity: 0.9185, target_sparsity: 0.65, step: 195000
lambda_1: -2.2479, lambda_2: 1124.6880 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.45 0.33 0.28 0.27 0.29 0.44]
infer remain: [0.74, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000010
10101111110011111111010110111101111110010100100100
10111111111111111011100110000000000000010000000000
10001111110110101011011010011100010001000001000000
10000011110010101011010010001100010001010000000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010000000001
10000000110010001001000010000100010001010000000001
Best eval score so far: 0.8994 @ step 174000 epoch 15.30
loss: 0.380577, lagrangian_loss: 0.002817, attention_score_distillation_loss: 0.000010
loss: 0.834360, lagrangian_loss: -0.001071, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 06:38:02
Evaluating: accuracy: 0.8987, eval_loss: 0.5499, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6488, expected_sequence_sparsity: 0.9185, target_sparsity: 0.65, step: 198000
lambda_1: -1.3839, lambda_2: 1142.3954 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.45 0.33 0.28 0.28 0.29 0.53]
infer remain: [0.74, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111111111111111110100000000000
10101111110111111111010110111101111010010100100100
10111111111111111011100110000001000000000000000000
10001111110110101011011010011101010001000000000000
10000001110110101011010010001100010100010000000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.8994 @ step 174000 epoch 15.30
loss: 0.017833, lagrangian_loss: 0.009395, attention_score_distillation_loss: 0.000010
loss: 0.299865, lagrangian_loss: 0.005452, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 06:52:23
Evaluating: accuracy: 0.8995, eval_loss: 0.5661, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 201000
lambda_1: -2.4810, lambda_2: 1159.6271 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.45 0.33 0.28 0.28 0.31 0.42]
infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111010110111111111010010100100100
10111111111111111011110110000000000000000000000000
10001111110110101011011010011101010001000000000000
10000001110110101011010010001100010000010000000000
10000000110010001001010010000100010001010100000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.8994 @ step 174000 epoch 15.30
Saving the best model so far: [Epoch 17 | Step: 201000 | MACs sparsity: 0.6594 | Score: 0.8995 | Loss: 0.5661]
loss: 0.365697, lagrangian_loss: 0.001115, attention_score_distillation_loss: 0.000010
loss: 0.017773, lagrangian_loss: 0.001056, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 07:07:14
Evaluating: accuracy: 0.8999, eval_loss: 0.543, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 204000
lambda_1: -1.5292, lambda_2: 1176.3229 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.45 0.33 0.28 0.28 0.32 0.42]
infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000100
11101111110011111111010110111101111010010100100100
10111111111111111011100110000100000000000000000000
10001111110110101011011010011101010001000000000000
10000001110010101011010010001100010000010100000000
10000010110010001001000010000100010101010100000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.8995 @ step 201000 epoch 17.68
Saving the best model so far: [Epoch 17 | Step: 204000 | MACs sparsity: 0.6594 | Score: 0.8999 | Loss: 0.543]
loss: 0.493361, lagrangian_loss: 0.008221, attention_score_distillation_loss: 0.000010
ETA: 20:08:46 | Epoch 17 finished. Took 3379.56 seconds.
loss: 0.029668, lagrangian_loss: 0.001050, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 07:22:15
Evaluating: accuracy: 0.9009, eval_loss: 0.548, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 207000
lambda_1: -1.3423, lambda_2: 1193.8641 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.44 0.33 0.28 0.28 0.36 0.46]
infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.26, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000010000
10101111110011111111010110111101111011010100100100
10111111111111111111100110000000000000000000000000
10001111110110101011011010011101010001000000000000
10000001110010101011010010001100010000010100000000
10000001110010001001000010000100010101010100000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010000000011
10000000010010001001000010000100010001010100000001
Best eval score so far: 0.8999 @ step 204000 epoch 17.94
Saving the best model so far: [Epoch 18 | Step: 207000 | MACs sparsity: 0.6594 | Score: 0.9009 | Loss: 0.548]
loss: 0.022745, lagrangian_loss: 0.000888, attention_score_distillation_loss: 0.000010
loss: 0.026422, lagrangian_loss: 0.004357, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 07:38:45
Evaluating: accuracy: 0.9012, eval_loss: 0.5363, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 210000
lambda_1: -0.9608, lambda_2: 1211.0245 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.44 0.32 0.28 0.27 0.38 0.48]
infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110101000000000
10111111110011111111010110111101111010010100100100
10111111111111111011100110001000000000000000000000
10001111110110101011011010011101010001000000000000
10000001110110101011010010001100010000010000000000
10000001110010001001000010000100010001010100000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9009 @ step 207000 epoch 18.20
Saving the best model so far: [Epoch 18 | Step: 210000 | MACs sparsity: 0.6594 | Score: 0.9012 | Loss: 0.5363]
loss: 0.459285, lagrangian_loss: 0.000205, attention_score_distillation_loss: 0.000010
loss: 0.036423, lagrangian_loss: 0.000209, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 07:53:18
Evaluating: accuracy: 0.902, eval_loss: 0.5128, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 213000
lambda_1: -1.1285, lambda_2: 1227.9618 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.44 0.32 0.28 0.27 0.43 0.53]
infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100001000000
10101111110011111111011110111101111010010100100100
10111111111111111011100110100000000000000000000000
10001111110110101011011010011100010001000010000000
10000001110010101011010010001100010001010000000000
10000000110010001001000010000100010101010100000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
10000000010010001001000010000100010001010100000001
Best eval score so far: 0.9012 @ step 210000 epoch 18.47
Saving the best model so far: [Epoch 18 | Step: 213000 | MACs sparsity: 0.6594 | Score: 0.902 | Loss: 0.5128]
loss: 0.173927, lagrangian_loss: -0.000254, attention_score_distillation_loss: 0.000010
loss: 0.348842, lagrangian_loss: 0.000050, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 08:08:18
Evaluating: accuracy: 0.9013, eval_loss: 0.5322, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 216000
lambda_1: -1.2842, lambda_2: 1245.1265 lambda_3: 0.0000
train remain: [0.73 0.65 0.45 0.44 0.32 0.28 0.28 0.42 0.45]
infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111110110111101111010010100100100
10111111111111111011100110000000100000000000000000
10001111110110101011011010011101010001000000000000
10000001110010101011010010001100010001010000000000
10000000110010001001010010000100010001010100000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9020 @ step 213000 epoch 18.73
loss: 0.034287, lagrangian_loss: 0.002586, attention_score_distillation_loss: 0.000010
ETA: 19:17:13 | Epoch 18 finished. Took 3480.99 seconds.
loss: 0.124186, lagrangian_loss: 0.012397, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 08:22:30
Evaluating: accuracy: 0.903, eval_loss: 0.5544, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 219000
lambda_1: -1.7371, lambda_2: 1261.9095 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.44 0.32 0.28 0.27 0.45 0.43]
infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111011110111101111010010100100100
10111111111111111011100110000000100000000000000000
10001111110110101011011010011101010001000000000000
10000001110010101011010010001100010001010000000000
10000000110010101001000010000100010001010110000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9020 @ step 213000 epoch 18.73
Saving the best model so far: [Epoch 19 | Step: 219000 | MACs sparsity: 0.6594 | Score: 0.903 | Loss: 0.5544]
loss: 0.016656, lagrangian_loss: 0.006842, attention_score_distillation_loss: 0.000010
loss: 0.191987, lagrangian_loss: 0.001296, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 08:36:59
Evaluating: accuracy: 0.9022, eval_loss: 0.5016, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 222000
lambda_1: -1.1584, lambda_2: 1279.1924 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.44 0.32 0.28 0.27 0.51 0.33]
infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100001000000
10111111110011111111010110111101111010010100100100
10111111111111111011100110000000000000000100000000
10001111110110101011011010011101010001000000000000
10000001110010101011010010001100010001010000000000
10000000110010001001010010000100010001010100000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9030 @ step 219000 epoch 19.26
loss: 0.053421, lagrangian_loss: 0.000117, attention_score_distillation_loss: 0.000010
loss: 0.015838, lagrangian_loss: 0.002198, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 08:51:08
Evaluating: accuracy: 0.9007, eval_loss: 0.5236, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 225000
lambda_1: -1.1455, lambda_2: 1296.5518 lambda_3: 0.0000
train remain: [0.73 0.66 0.45 0.44 0.32 0.28 0.27 0.61 0.3 ]
infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111111011111111010110111101111010010100100100
10111111111111111011100110000100000000000000000000
10001111110110101011011010011101010001000000000000
10000001110010101011010010001100010001010000000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9030 @ step 219000 epoch 19.26
loss: 0.015111, lagrangian_loss: 0.001796, attention_score_distillation_loss: 0.000010
loss: 0.012449, lagrangian_loss: -0.000120, attention_score_distillation_loss: 0.000010
ETA: 18:18:23 | Epoch 19 finished. Took 3082.5 seconds.
----------------------------------------------------------------------
time: 2023-07-20 09:05:26
Evaluating: accuracy: 0.9014, eval_loss: 0.5296, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 228000
lambda_1: -2.2491, lambda_2: 1313.6938 lambda_3: 0.0000
train remain: [0.74 0.66 0.45 0.43 0.32 0.28 0.28 0.49 0.3 ]
infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111010110111101111011010100100100
10111111111111111111100110000000000000000000000000
10000111110110101011011010011100010101000000000000
10000001110010101011010010001100010001010000000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010000000001
10000000110010001001000010000100010001010100000000
Best eval score so far: 0.9030 @ step 219000 epoch 19.26
loss: 0.009079, lagrangian_loss: -0.000901, attention_score_distillation_loss: 0.000010
loss: 0.015812, lagrangian_loss: 0.008045, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 09:19:39
Evaluating: accuracy: 0.9026, eval_loss: 0.5065, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 231000
lambda_1: -0.7011, lambda_2: 1331.4109 lambda_3: 0.0000
train remain: [0.74 0.66 0.45 0.43 0.32 0.28 0.27 0.4  0.27]
infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000010
10101111110011111111010110111111111010010100100100
10111111111111111011110110000000000000000000000000
10000111110110101011011010011101010001000000000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010001010101000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9030 @ step 219000 epoch 19.26
loss: 0.565751, lagrangian_loss: -0.000060, attention_score_distillation_loss: 0.000010
loss: 0.011595, lagrangian_loss: 0.000126, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 09:33:56
Evaluating: accuracy: 0.9017, eval_loss: 0.5104, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 234000
lambda_1: -1.2062, lambda_2: 1348.4353 lambda_3: 0.0000
train remain: [0.74 0.66 0.45 0.43 0.32 0.28 0.27 0.52 0.28]
infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10111111110011111111010110111101111010010100100100
10111111111111111011100110000000100000000000000000
10000111110110101011011010011101010001000000000000
10000001110010101001010010001100010001010100000000
10000000110010001001010010000100010001010100000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
10000000010010001001000010000100010001010100000001
Best eval score so far: 0.9030 @ step 219000 epoch 19.26
loss: 0.083085, lagrangian_loss: -0.000266, attention_score_distillation_loss: 0.000010
loss: 0.012467, lagrangian_loss: 0.006103, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 09:48:13
Evaluating: accuracy: 0.9017, eval_loss: 0.526, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 237000
lambda_1: -1.0935, lambda_2: 1365.8163 lambda_3: 0.0000
train remain: [0.73 0.66 0.44 0.43 0.32 0.28 0.27 0.52 0.25]
infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100100000000
10101111110011111111010110111101111010010100100110
10111111111111111011110110000000000000000000000000
10000111110110101011011010011100010001010000000000
10000001110010101001010010001100010001010010000000
10000000110010001001000010000100010001010110000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9030 @ step 219000 epoch 19.26
loss: 0.017863, lagrangian_loss: 0.005429, attention_score_distillation_loss: 0.000010
loss: 0.011142, lagrangian_loss: 0.000162, attention_score_distillation_loss: 0.000010
ETA: 17:23:18 | Epoch 20 finished. Took 3285.07 seconds.
----------------------------------------------------------------------
time: 2023-07-20 10:02:25
Evaluating: accuracy: 0.9028, eval_loss: 0.5382, token_prune_loc: [True, True, True, True, True, True, True, False, True], macs_sparsity: 0.6594, expected_sparsity: 0.6493, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 240000
lambda_1: -0.9370, lambda_2: 1383.4458 lambda_3: 0.0000
train remain: [0.74 0.66 0.44 0.43 0.32 0.28 0.28 0.63 0.25]
infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 1.0, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100010000000
10101111110011111111011110111101111010010100100100
10111111111111111011110110000000000000000000000000
10001111110110101011011010011100010001000000000000
10000011110010101001010010001100010001010000000000
10000000110010001001000010000100010001010110000001
10000000110010001001000010000100010001010100000011
11111111111111111111111111111111111111111111111111
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9030 @ step 219000 epoch 19.26
loss: 0.007702, lagrangian_loss: 0.008356, attention_score_distillation_loss: 0.000010
loss: 0.266675, lagrangian_loss: 0.003927, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 10:16:39
Evaluating: accuracy: 0.9044, eval_loss: 0.5303, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 243000
lambda_1: -1.2799, lambda_2: 1400.5848 lambda_3: 0.0000
train remain: [0.74 0.66 0.44 0.43 0.32 0.28 0.28 0.59 0.26]
infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
11101111110011111111010110111101111010010100100100
10111111111111111011100110000000000100000000000000
10000111110110101011011010011100010101000000000000
10000001110010101001010010001100010101010000000000
10000000110010001001000010000100010101010100000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9030 @ step 219000 epoch 19.26
Saving the best model so far: [Epoch 21 | Step: 243000 | MACs sparsity: 0.6594 | Score: 0.9044 | Loss: 0.5303]
loss: 0.124508, lagrangian_loss: 0.000145, attention_score_distillation_loss: 0.000010
loss: 0.090828, lagrangian_loss: 0.001139, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 10:31:09
Evaluating: accuracy: 0.9021, eval_loss: 0.5192, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 246000
lambda_1: -1.1477, lambda_2: 1417.9023 lambda_3: 0.0000
train remain: [0.74 0.66 0.45 0.43 0.32 0.28 0.28 0.51 0.26]
infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100001000000
10101111110011111111011110111101111010010100100100
10111111111111111111100110000000000000000000000000
10000111110110101011011010011100010101000000000000
10000001110010101001010010001100010101010000000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
10000000110010001001000010000100010001010000000001
Best eval score so far: 0.9044 @ step 243000 epoch 21.37
loss: 0.021891, lagrangian_loss: 0.001887, attention_score_distillation_loss: 0.000010
loss: 0.013079, lagrangian_loss: 0.031589, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 10:45:19
Evaluating: accuracy: 0.903, eval_loss: 0.515, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 249000
lambda_1: -0.8239, lambda_2: 1435.4276 lambda_3: 0.0000
train remain: [0.74 0.66 0.44 0.43 0.32 0.28 0.28 0.42 0.26]
infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000010
10101111110011111111010110111101111010010100100110
10111111111111111111100110000000000000000000000000
10000111110110101011011010011101010001000000000000
10000001110010101001010010001100010101010000000000
10000000110010001001000010000100010001010110000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
10000000110010001001000010000100010001010000000001
Best eval score so far: 0.9044 @ step 243000 epoch 21.37
loss: 0.013151, lagrangian_loss: 0.001103, attention_score_distillation_loss: 0.000010
ETA: 16:28:20 | Epoch 21 finished. Took 3289.8 seconds.
loss: 0.067399, lagrangian_loss: 0.000010, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 10:59:26
Evaluating: accuracy: 0.9034, eval_loss: 0.5368, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 252000
lambda_1: -0.9058, lambda_2: 1452.7882 lambda_3: 0.0000
train remain: [0.74 0.66 0.44 0.43 0.32 0.28 0.28 0.37 0.26]
infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000100000
10101111110111111111010110111101111010010100100100
10111111111111111111100110000000000000000000000000
10000111110110101011011010011100010001000010000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010101010110000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
10000000110010001001000010000100010001010000000001
Best eval score so far: 0.9044 @ step 243000 epoch 21.37
loss: 0.219977, lagrangian_loss: 0.003928, attention_score_distillation_loss: 0.000010
loss: 0.248259, lagrangian_loss: -0.000301, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 11:13:34
Evaluating: accuracy: 0.9031, eval_loss: 0.5162, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 255000
lambda_1: -0.4067, lambda_2: 1470.0902 lambda_3: 0.0000
train remain: [0.74 0.66 0.44 0.43 0.32 0.28 0.28 0.34 0.25]
infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111111100000000000
10101111110011111111010110111101111011010100100100
10111111111111111011100110001000000000000000000000
10001111110110101011011010011100010001000000000000
10000001110010101001010010001100010101010000000000
10000001110010001011000010000100010001010100000000
10000000110010001001000010010100010001010100000001
10000000110010001001000010000100010001010100000000
10000000010010001001000010000100010001010100000001
Best eval score so far: 0.9044 @ step 243000 epoch 21.37
loss: 0.014459, lagrangian_loss: 0.002309, attention_score_distillation_loss: 0.000010
loss: 0.347549, lagrangian_loss: 0.010246, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 11:27:42
Evaluating: accuracy: 0.9051, eval_loss: 0.5354, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 258000
lambda_1: -1.1459, lambda_2: 1487.8198 lambda_3: 0.0000
train remain: [0.74 0.65 0.44 0.43 0.32 0.28 0.28 0.35 0.25]
infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111010110111101111110010100100100
10111111111111111011100110000000000000000010000000
10000111110110101011011010011101010001000000000000
10000001110010101001010010001100010101010000000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
10000000010010001001000010000100010001010100000001
Best eval score so far: 0.9044 @ step 243000 epoch 21.37
Saving the best model so far: [Epoch 22 | Step: 258000 | MACs sparsity: 0.6594 | Score: 0.9051 | Loss: 0.5354]
loss: 0.076614, lagrangian_loss: 0.000602, attention_score_distillation_loss: 0.000010
loss: 0.013523, lagrangian_loss: 0.001096, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 11:42:15
Evaluating: accuracy: 0.9042, eval_loss: 0.5124, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 261000
lambda_1: -1.0278, lambda_2: 1504.6995 lambda_3: 0.0000
train remain: [0.74 0.66 0.44 0.43 0.32 0.28 0.28 0.45 0.26]
infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10111111110011111111010110111101111010010100100100
10111111111111111011100110000000000000000000010000
10000111110110101011011010011101010101000000000000
10000001110010101001010010001100010001010010000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000000
10000000110010001001000010000100010001010000000001
Best eval score so far: 0.9051 @ step 258000 epoch 22.69
loss: 0.040849, lagrangian_loss: 0.001851, attention_score_distillation_loss: 0.000010
ETA: 15:33:17 | Epoch 22 finished. Took 3283.32 seconds.
loss: 0.097342, lagrangian_loss: 0.009542, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 11:56:25
Evaluating: accuracy: 0.9032, eval_loss: 0.547, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 264000
lambda_1: -0.6112, lambda_2: 1521.8701 lambda_3: 0.0000
train remain: [0.74 0.66 0.44 0.43 0.32 0.28 0.29 0.51 0.25]
infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110101000000000
10101111110011111111011110111101111010010100100100
10111111111111111011100110000000000000000000100000
10000111110110101011011010011101010101000000000000
10000011110010101001010010001100010001010000000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010000000001
10000000110010001001000010000100010001010000000001
Best eval score so far: 0.9051 @ step 258000 epoch 22.69
loss: 0.039525, lagrangian_loss: 0.003719, attention_score_distillation_loss: 0.000010
loss: 0.004788, lagrangian_loss: 0.004833, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 12:10:34
Evaluating: accuracy: 0.9022, eval_loss: 0.5393, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 267000
lambda_1: -0.9719, lambda_2: 1540.0824 lambda_3: 0.0000
train remain: [0.74 0.66 0.43 0.43 0.32 0.28 0.3  0.42 0.25]
infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111110110111101111010010100100100
10111111111111111011100110000000000000100000000000
10000111110110101011011010011100010101000000000000
10000001110010101001010010001100010101010000000000
10000000110010001001000010001100010001010110000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010100000001
Best eval score so far: 0.9051 @ step 258000 epoch 22.69
loss: 0.079875, lagrangian_loss: -0.000143, attention_score_distillation_loss: 0.000010
loss: 0.336385, lagrangian_loss: 0.002171, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 12:24:41
Evaluating: accuracy: 0.9033, eval_loss: 0.552, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6503, expected_sequence_sparsity: 0.9189, target_sparsity: 0.65, step: 270000
lambda_1: -0.7755, lambda_2: 1557.3724 lambda_3: 0.0000
train remain: [0.74 0.66 0.43 0.43 0.32 0.28 0.31 0.42 0.25]
infer remain: [0.74, 0.66, 0.42, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110111111111010110111101111010010100100100
10111111111111111011100110000000000000000000000000
10000111110110101011011010011100010101000001000000
10000001110010101001010010001100010001010010000000
10000001110010001001000010000100010001010100000001
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9051 @ step 258000 epoch 22.69
loss: 0.024105, lagrangian_loss: 0.001687, attention_score_distillation_loss: 0.000010
loss: 0.007869, lagrangian_loss: 0.000499, attention_score_distillation_loss: 0.000010
ETA: 14:35:41 | Epoch 23 finished. Took 3050.79 seconds.
----------------------------------------------------------------------
time: 2023-07-20 12:38:47
Evaluating: accuracy: 0.9029, eval_loss: 0.5254, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6503, expected_sequence_sparsity: 0.9189, target_sparsity: 0.65, step: 273000
lambda_1: -0.4702, lambda_2: 1574.4080 lambda_3: 0.0000
train remain: [0.74 0.66 0.43 0.43 0.32 0.28 0.3  0.39 0.24]
infer remain: [0.74, 0.66, 0.42, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100010000000
10111111110011111111010110111101111010010100100100
10111111111111111011100110000000000000000000000000
10000111110110101011011010011100010101000001000000
10000001110010101001010010001100010001010010000000
10000000110010001001010010000100010001010100000001
10000000110010001001000010000100010001010100000011
10000000010010001001000010000100010001010100000001
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9051 @ step 258000 epoch 22.69
loss: 0.389616, lagrangian_loss: 0.009942, attention_score_distillation_loss: 0.000010
loss: 0.013081, lagrangian_loss: 0.005521, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 12:52:58
Evaluating: accuracy: 0.9037, eval_loss: 0.5323, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 276000
lambda_1: -0.9596, lambda_2: 1591.1188 lambda_3: 0.0000
train remain: [0.74 0.66 0.43 0.43 0.32 0.27 0.32 0.39 0.24]
infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100001000000
10111111110011111111010110111101111010010100100100
10111111111111111101100110000000000000000000000000
10000111110110101011011010011100010101000000000000
10000001110010101001010010001100010001010010000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000011
10000000010010001001000010000100010001010100000001
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9051 @ step 258000 epoch 22.69
loss: 0.016249, lagrangian_loss: 0.000320, attention_score_distillation_loss: 0.000010
loss: 0.135135, lagrangian_loss: 0.004868, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 13:07:04
Evaluating: accuracy: 0.9043, eval_loss: 0.5183, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 279000
lambda_1: -0.6309, lambda_2: 1608.5593 lambda_3: 0.0000
train remain: [0.74 0.66 0.43 0.43 0.32 0.27 0.31 0.36 0.24]
infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.28, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110110000000000
10111111110011111111010110111101111010010100100100
10111111111111111001100110000000000000000010000000
10000111110110101011011010011100010101000000000000
10000011110010101001010010001100010001010000000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000001
10000000010010001001000010000100010001010100000001
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9051 @ step 258000 epoch 22.69
loss: 0.007925, lagrangian_loss: 0.001710, attention_score_distillation_loss: 0.000010
loss: 0.025356, lagrangian_loss: 0.005295, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 13:21:13
Evaluating: accuracy: 0.904, eval_loss: 0.5355, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 282000
lambda_1: -1.5243, lambda_2: 1626.1809 lambda_3: 0.0000
train remain: [0.74 0.66 0.43 0.43 0.32 0.28 0.33 0.39 0.25]
infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.28, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111010110111101111110010100100100
10111111111111111001100110000100000000000000000000
10000111110110101011011010011100010001000000100000
10000001110010101001010010001100010001011000000000
10000000110110001001000010000101010001010100000000
10000000110010001001000010000100010001010100000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9051 @ step 258000 epoch 22.69
loss: 0.299053, lagrangian_loss: 0.000737, attention_score_distillation_loss: 0.000010
loss: 0.703120, lagrangian_loss: 0.000250, attention_score_distillation_loss: 0.000010
ETA: 13:40:42 | Epoch 24 finished. Took 3259.07 seconds.
----------------------------------------------------------------------
time: 2023-07-20 13:35:22
Evaluating: accuracy: 0.9049, eval_loss: 0.5242, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 285000
lambda_1: -0.2742, lambda_2: 1643.1600 lambda_3: 0.0000
train remain: [0.74 0.66 0.43 0.43 0.32 0.28 0.31 0.45 0.25]
infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000100000
10101111110011111111010110111101111011010100100100
10111111111111111001100110000000000001010000000000
10000111110110101011011010011100010001000000100000
10000001110010101001010010001100010001011000000000
10000000110010001001000010000100010001010111000000
10000000110010001001000010000100010001010110000000
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9051 @ step 258000 epoch 22.69
loss: 0.015952, lagrangian_loss: 0.000811, attention_score_distillation_loss: 0.000010
loss: 0.018504, lagrangian_loss: 0.011957, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 13:49:31
Evaluating: accuracy: 0.9041, eval_loss: 0.5208, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 288000
lambda_1: -0.9049, lambda_2: 1660.4192 lambda_3: 0.0000
train remain: [0.74 0.66 0.43 0.43 0.32 0.28 0.3  0.44 0.25]
infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.28, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100001000000
10101111110111111111010110111101111010010100100100
10111111111111111001100110000100000000000000000000
10000111110110101011011010111100010001000000000000
10000001110010101001010010001100010101010000000000
10000000110010001001000010000100010101010100000001
10000000110010001001000010000100010001010100000001
10000000010010001001000010000100010001010100000001
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9051 @ step 258000 epoch 22.69
loss: 0.009474, lagrangian_loss: 0.003730, attention_score_distillation_loss: 0.000010
loss: 0.015891, lagrangian_loss: 0.001593, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 14:03:43
Evaluating: accuracy: 0.9046, eval_loss: 0.5021, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 291000
lambda_1: -0.6821, lambda_2: 1677.4827 lambda_3: 0.0000
train remain: [0.74 0.66 0.43 0.43 0.31 0.28 0.3  0.5  0.24]
infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.28, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110101000000000
10101111110011111111010110111101111010010100101100
10111111111111111001100110000000000000000100000000
10000111110110101011011010011100010001000010000000
10000001110010101001010010001100010001010100000000
10000000110010001001010010000100010001010100000001
10000000110010001001000010000100010001010100000001
10000000110010001001000010000100010001010100000000
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9051 @ step 258000 epoch 22.69
loss: 0.174100, lagrangian_loss: 0.028944, attention_score_distillation_loss: 0.000010
loss: 0.008187, lagrangian_loss: 0.002347, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 14:17:51
Evaluating: accuracy: 0.9044, eval_loss: 0.5215, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 294000
lambda_1: -0.5183, lambda_2: 1694.6652 lambda_3: 0.0000
train remain: [0.74 0.66 0.43 0.43 0.32 0.28 0.31 0.46 0.26]
infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.28, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111111111111111110100000000000
10101111110011111111010111111101111010010100100100
10111111111111111001110110000000000000000000000000
10000111110110101011011010011100010101000000000000
10000001110010101001010010001100010001010010000000
10000001110010001001000010000100010001010100000001
10000000110010001001000010000100010001010100000001
10000000110010001001000010000100010001010100000000
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9051 @ step 258000 epoch 22.69
loss: 0.016532, lagrangian_loss: 0.003743, attention_score_distillation_loss: 0.000010
loss: 0.023074, lagrangian_loss: 0.001229, attention_score_distillation_loss: 0.000010
ETA: 12:45:50 | Epoch 25 finished. Took 3264.97 seconds.
----------------------------------------------------------------------
time: 2023-07-20 14:32:01
Evaluating: accuracy: 0.9055, eval_loss: 0.4988, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 297000
lambda_1: -0.8036, lambda_2: 1711.3834 lambda_3: 0.0000
train remain: [0.74 0.66 0.43 0.43 0.31 0.27 0.36 0.56 0.26]
infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.28, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110101000000000
10101111110111111111010110111101111010010100100100
10111111111111111001110110000000000000000000000000
10000111110110101011011010011100010001000001000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010001010100000011
10000000110010001001000010000100010001010100000001
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9051 @ step 258000 epoch 22.69
Saving the best model so far: [Epoch 26 | Step: 297000 | MACs sparsity: 0.6594 | Score: 0.9055 | Loss: 0.4988]
loss: 0.017057, lagrangian_loss: 0.002419, attention_score_distillation_loss: 0.000010
loss: 0.013105, lagrangian_loss: 0.003510, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 14:46:45
Evaluating: accuracy: 0.9042, eval_loss: 0.5095, token_prune_loc: [True, True, True, True, True, True, True, False, True], macs_sparsity: 0.6594, expected_sparsity: 0.6507, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 300000
lambda_1: -0.3759, lambda_2: 1729.3628 lambda_3: 0.0000
train remain: [0.74 0.66 0.43 0.43 0.32 0.27 0.4  0.64 0.26]
infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.26, 0.28, 1.0, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110110000000000
10101111110111111111010110111101111010010100100100
10111111111111111001100110000100000000000000000000
10000111110110101011011010011100010001000001000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010001010100000001
10000000110010001001000010000100010001010100000011
11111111111111111111111111111111111111111111111111
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9055 @ step 297000 epoch 26.12
loss: 0.407412, lagrangian_loss: 0.002169, attention_score_distillation_loss: 0.000010
loss: 0.058278, lagrangian_loss: 0.004591, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 15:00:59
Evaluating: accuracy: 0.9059, eval_loss: 0.5163, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 303000
lambda_1: -1.1274, lambda_2: 1746.3228 lambda_3: 0.0000
train remain: [0.75 0.66 0.42 0.43 0.31 0.26 0.39 0.49 0.25]
infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.26, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10111111110011111111010110111101111010010100100100
10111111111111111001100110000100000000000000000000
10000111110110101011011010011101010001000000000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010101010100000000
10000000110010001001000010000100010001010100000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9055 @ step 297000 epoch 26.12
Saving the best model so far: [Epoch 26 | Step: 303000 | MACs sparsity: 0.6594 | Score: 0.9059 | Loss: 0.5163]
loss: 0.231925, lagrangian_loss: 0.000309, attention_score_distillation_loss: 0.000010
loss: 0.284112, lagrangian_loss: 0.019746, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 15:15:54
Evaluating: accuracy: 0.9043, eval_loss: 0.5117, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 306000
lambda_1: -0.6597, lambda_2: 1763.6797 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.43 0.32 0.27 0.32 0.34 0.25]
infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.26, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100100000000
10101111111011111111010110111101111010010100100100
10111111111111111000100110000000001010000000000000
10000111110110101011011010011101010001000000000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010001010100000001
10000000110010001001000010000100010001010100000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9059 @ step 303000 epoch 26.65
loss: 0.010959, lagrangian_loss: 0.000706, attention_score_distillation_loss: 0.000010
ETA: 11:51:39 | Epoch 26 finished. Took 3348.45 seconds.
loss: 0.005960, lagrangian_loss: 0.002558, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 15:30:09
Evaluating: accuracy: 0.906, eval_loss: 0.5142, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6521, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 309000
lambda_1: -1.8645, lambda_2: 1780.6941 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.42 0.31 0.26 0.32 0.4  0.24]
infer remain: [0.74, 0.66, 0.4, 0.42, 0.32, 0.26, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111111111111111110100000000000
10101111110011111111010110111101111110010100100100
10111111111111111000100110000000001000000000000000
10000111110110101011011010011101010001000000000000
10000011110010101001010010001100010000010100000000
10000000110010001001000010000100010001010100000001
10000000110010001001000010000100010001010100000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9059 @ step 303000 epoch 26.65
Saving the best model so far: [Epoch 27 | Step: 309000 | MACs sparsity: 0.6622 | Score: 0.906 | Loss: 0.5142]
loss: 0.013322, lagrangian_loss: 0.012124, attention_score_distillation_loss: 0.000010
loss: 0.029035, lagrangian_loss: -0.000044, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 15:44:50
Evaluating: accuracy: 0.9045, eval_loss: 0.5212, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6521, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 312000
lambda_1: -1.0970, lambda_2: 1797.9958 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.42 0.31 0.26 0.36 0.35 0.25]
infer remain: [0.74, 0.66, 0.4, 0.42, 0.32, 0.26, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111111100000000000
10101111110011111111010110111101111010011100100100
10111111111111111000100110001000000000000000000000
10000111110110101011011010011100010001000001000000
10000001110010101001010010001101010001010000000000
10000010110010001001000010000101010001010000000000
10000000110010001001000010000100010001010100000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9060 @ step 309000 epoch 27.17
loss: 0.012972, lagrangian_loss: 0.000404, attention_score_distillation_loss: 0.000010
loss: 0.004734, lagrangian_loss: 0.009444, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 15:58:59
Evaluating: accuracy: 0.9055, eval_loss: 0.5199, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6521, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 315000
lambda_1: -0.7523, lambda_2: 1815.5365 lambda_3: 0.0000
train remain: [0.76 0.66 0.41 0.42 0.31 0.26 0.37 0.36 0.25]
infer remain: [0.74, 0.66, 0.4, 0.42, 0.32, 0.26, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000001000
10101111110011111111010110111101111110010100100100
10111111111111111000101110000000000000000000000000
10000111110110101011011010011100010001010000000000
10000001110010101011010010001100010001010000000000
10000000110010001001000010000100010001010100000001
10000000110010001001000010000100010001010100000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9060 @ step 309000 epoch 27.17
loss: 0.005888, lagrangian_loss: 0.005344, attention_score_distillation_loss: 0.000010
loss: 0.011883, lagrangian_loss: 0.010148, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 16:13:09
Evaluating: accuracy: 0.9048, eval_loss: 0.5194, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 318000
lambda_1: -0.1479, lambda_2: 1832.5874 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.42 0.32 0.26 0.39 0.38 0.25]
infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.26, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111111111111111110100000000000
10101111110011111111010111111101111010010100100100
10111111111111111000100110000100001000000000000000
10000111110110101011011010011100010001010000000000
10000001110010101001010010001100010100010100000000
10000001110010001001000010000100010001010100000000
10000000110010001001000010000100010001010100000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9060 @ step 309000 epoch 27.17
loss: 0.012396, lagrangian_loss: 0.000386, attention_score_distillation_loss: 0.000010
ETA: 10:57:02 | Epoch 27 finished. Took 3300.38 seconds.
loss: 0.111784, lagrangian_loss: 0.001690, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 16:27:24
Evaluating: accuracy: 0.9053, eval_loss: 0.5271, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6521, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 321000
lambda_1: -0.3673, lambda_2: 1849.7450 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.42 0.32 0.26 0.37 0.36 0.26]
infer remain: [0.74, 0.66, 0.4, 0.42, 0.32, 0.26, 0.28, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000010000
10101111110011111111010110111101111110010100100100
10111111111111111000100110001000000000000000000000
10000111110110101011011010011100010001000100000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010001010101000000
10000000110010001001000010000100010001010100000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9060 @ step 309000 epoch 27.17
loss: 0.359683, lagrangian_loss: 0.014730, attention_score_distillation_loss: 0.000010
loss: 0.254084, lagrangian_loss: 0.002681, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 16:41:37
Evaluating: accuracy: 0.9052, eval_loss: 0.5209, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 324000
lambda_1: -0.5201, lambda_2: 1866.8728 lambda_3: 0.0000
train remain: [0.75 0.66 0.42 0.42 0.32 0.26 0.34 0.34 0.26]
infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.26, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10111111110011111111010110111101111010010100100100
10111111111111111000100110001000000000100000000000
10000111110110101011011010011100010001000000100000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010001010101000000
10000000110010001001000010000100010001010100000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9060 @ step 309000 epoch 27.17
loss: 0.007427, lagrangian_loss: 0.000878, attention_score_distillation_loss: 0.000010
loss: 0.014517, lagrangian_loss: 0.001718, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 16:55:53
Evaluating: accuracy: 0.9052, eval_loss: 0.5356, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6521, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 327000
lambda_1: -0.4991, lambda_2: 1884.3604 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.36 0.38 0.3 ]
infer remain: [0.74, 0.66, 0.4, 0.42, 0.32, 0.26, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111010110111101111010010110100100
10111111111111111000100110000000010000000000000000
10001111110110101011011010011100010001000000000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010001100010001010100000000
10000000110010001011000010000100010001010100000000
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9060 @ step 309000 epoch 27.17
loss: 0.203065, lagrangian_loss: 0.011325, attention_score_distillation_loss: 0.000010
loss: 0.146180, lagrangian_loss: 0.000048, attention_score_distillation_loss: 0.000010
ETA: 10:00:56 | Epoch 28 finished. Took 3072.73 seconds.
----------------------------------------------------------------------
time: 2023-07-20 17:10:06
Evaluating: accuracy: 0.9068, eval_loss: 0.5168, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 330000
lambda_1: -0.3473, lambda_2: 1902.1956 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.34 0.41 0.33]
infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.26, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110110000000000
11101111110011111111010110111101111010010100100100
10111111111111111110100110000000000000000000000000
10000111110110101011011010011100010001000100000000
10000001110010101001010010001100010100010100000000
10000000110010001001010010000100010001010100000000
10000000110010001011000010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9060 @ step 309000 epoch 27.17
Saving the best model so far: [Epoch 29 | Step: 330000 | MACs sparsity: 0.6594 | Score: 0.9068 | Loss: 0.5168]
loss: 0.008257, lagrangian_loss: 0.011549, attention_score_distillation_loss: 0.000010
loss: 0.016565, lagrangian_loss: 0.003160, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 17:24:46
Evaluating: accuracy: 0.9069, eval_loss: 0.5136, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 333000
lambda_1: -0.7132, lambda_2: 1917.9243 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.29 0.33 0.29]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111010111111101111010010100100100
10111111111111111000110110000000000000000000000000
10000111110110101011011010011100010001000000000000
10000001110010101001010010001100010000010101000000
10000000110010001001000010000100010001010000100001
10000000110010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9068 @ step 330000 epoch 29.02
Saving the best model so far: [Epoch 29 | Step: 333000 | MACs sparsity: 0.6622 | Score: 0.9069 | Loss: 0.5136]
loss: 0.006383, lagrangian_loss: 0.003007, attention_score_distillation_loss: 0.000010
loss: 0.257783, lagrangian_loss: 0.001209, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 17:39:22
Evaluating: accuracy: 0.9062, eval_loss: 0.516, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 336000
lambda_1: -1.0286, lambda_2: 1934.3567 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.41 0.31 0.26 0.31 0.28 0.3 ]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000010000
11101111110011111111010110111101111010010100100100
10111111111111111000100110100000000000000000000000
10000011110110101011011010011101010001000000000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010001010000000011
10000000110010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9069 @ step 333000 epoch 29.29
loss: 0.011127, lagrangian_loss: 0.000437, attention_score_distillation_loss: 0.000010
loss: 0.027935, lagrangian_loss: 0.001384, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 17:53:37
Evaluating: accuracy: 0.9061, eval_loss: 0.5259, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6513, expected_sequence_sparsity: 0.9191, target_sparsity: 0.65, step: 339000
lambda_1: -0.3205, lambda_2: 1951.8823 lambda_3: 0.0000
train remain: [0.75 0.66 0.42 0.41 0.32 0.26 0.36 0.27 0.31]
infer remain: [0.74, 0.66, 0.42, 0.4, 0.32, 0.26, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110110000000000
10101111110111111111010110111101111010010100100100
10111111111111111000100110100000000000000100000000
10000111110110101011011010011100010001000000000000
10000001110010101001010010001100010001010100000000
10000001110010001011000010000100010001010000000000
10000000110010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9069 @ step 333000 epoch 29.29
loss: 0.010422, lagrangian_loss: 0.000014, attention_score_distillation_loss: 0.000010
loss: 0.048148, lagrangian_loss: 0.000488, attention_score_distillation_loss: 0.000010
ETA: 9:06:35 | Epoch 29 finished. Took 3328.34 seconds.
----------------------------------------------------------------------
time: 2023-07-20 18:07:48
Evaluating: accuracy: 0.9068, eval_loss: 0.5195, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 342000
lambda_1: -0.3236, lambda_2: 1968.7819 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.4  0.31 0.39]
infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.26, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110111111111010110111101111010010100100100
10111111111111111100100110000000000100000000000000
10000011110110101011011010011101010001000000100000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010101010100000000
10000000110010001001000010000100010001010100000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9069 @ step 333000 epoch 29.29
loss: 0.005336, lagrangian_loss: 0.001894, attention_score_distillation_loss: 0.000010
loss: 0.006858, lagrangian_loss: 0.001738, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 18:22:02
Evaluating: accuracy: 0.9067, eval_loss: 0.5062, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 345000
lambda_1: -0.2562, lambda_2: 1985.3673 lambda_3: 0.0000
train remain: [0.75 0.66 0.42 0.41 0.32 0.26 0.34 0.31 0.37]
infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.26, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111011110111101111010010100100100
10111111111111111000100110000100000100000000000000
10000011110110101011011010011100011011000000000000
10000001110010101001010010001100010100010100000000
10000000110010001001000010000100010101010100000000
10000000110010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9069 @ step 333000 epoch 29.29
loss: 0.043365, lagrangian_loss: 0.002136, attention_score_distillation_loss: 0.000010
loss: 0.011090, lagrangian_loss: 0.000438, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 18:36:21
Evaluating: accuracy: 0.9062, eval_loss: 0.513, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 348000
lambda_1: -0.3346, lambda_2: 2002.7759 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.32 0.33 0.39]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111011110111101111010010100100100
10111111111111111000100110000000000000000000000100
10000111110110101011011010011100010001000000000000
10000011110010101001010010001100010000010100000000
10000000110010001001010010000100010001010100000000
10000000110010001001000010000100010001010100000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9069 @ step 333000 epoch 29.29
loss: 0.007361, lagrangian_loss: 0.004158, attention_score_distillation_loss: 0.000010
loss: 0.007597, lagrangian_loss: 0.000635, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 18:50:34
Evaluating: accuracy: 0.9061, eval_loss: 0.5121, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6521, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 351000
lambda_1: -0.5228, lambda_2: 2020.1655 lambda_3: 0.0000
train remain: [0.75 0.66 0.42 0.41 0.32 0.26 0.31 0.33 0.38]
infer remain: [0.74, 0.66, 0.4, 0.42, 0.32, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110111111111010110111101111010010100100100
10111111111111111000100110000000000000000010000000
10000111110110101011011010011100010101000000000000
10000001110010101001010010001100010100010100000000
10000000110010001001000010000100010101010100000000
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9069 @ step 333000 epoch 29.29
loss: 0.011360, lagrangian_loss: 0.006206, attention_score_distillation_loss: 0.000010
loss: 0.023159, lagrangian_loss: 0.039438, attention_score_distillation_loss: 0.000010
ETA: 8:11:57 | Epoch 30 finished. Took 3284.77 seconds.
----------------------------------------------------------------------
time: 2023-07-20 19:04:47
Evaluating: accuracy: 0.9064, eval_loss: 0.5295, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6521, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 354000
lambda_1: -0.7080, lambda_2: 2037.1526 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.28 0.32 0.34]
infer remain: [0.74, 0.66, 0.4, 0.42, 0.32, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110111111111010110111101111010010100100100
10111111111111111000100110000000000000000010000000
10000011110110101011011010011100010101000100000000
10000001110010101001010010001100010001010100000000
10000001110010001001000010000100010001010100000000
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9069 @ step 333000 epoch 29.29
loss: 0.010760, lagrangian_loss: 0.000093, attention_score_distillation_loss: 0.000010
loss: 0.017829, lagrangian_loss: 0.003529, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 19:19:00
Evaluating: accuracy: 0.9048, eval_loss: 0.5163, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 357000
lambda_1: -0.5968, lambda_2: 2054.3672 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.28 0.35 0.3 ]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111010110111101111010011100100100
10111111111111111000100110000000100000000000000000
10000011110110101011011010011100010101000000000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010001010100000001
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9069 @ step 333000 epoch 29.29
loss: 0.006646, lagrangian_loss: 0.005841, attention_score_distillation_loss: 0.000010
loss: 0.018823, lagrangian_loss: 0.003421, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 19:33:21
Evaluating: accuracy: 0.9069, eval_loss: 0.5159, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 360000
lambda_1: -0.6352, lambda_2: 2071.4238 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.3  0.39 0.31]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111010110111111111010010100100100
10111111111111111100100110000000000000000000000000
10000011110110101011011010011100010101000000000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010001010100000001
10000000110010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9069 @ step 333000 epoch 29.29
loss: 0.368457, lagrangian_loss: 0.000409, attention_score_distillation_loss: 0.000010
loss: 0.008010, lagrangian_loss: 0.000022, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 19:47:38
Evaluating: accuracy: 0.9071, eval_loss: 0.5138, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6521, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 363000
lambda_1: -0.1604, lambda_2: 2088.8484 lambda_3: 0.0000
train remain: [0.74 0.67 0.41 0.41 0.33 0.26 0.3  0.34 0.28]
infer remain: [0.74, 0.66, 0.4, 0.42, 0.32, 0.26, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110111111111010110111101111010010100100100
10111111111111111000100110000100000000000000000000
10000011110110101011011010011100010101000010000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010001010101000000
10000000110010001001000010000100010001010100000001
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9069 @ step 333000 epoch 29.29
Saving the best model so far: [Epoch 31 | Step: 363000 | MACs sparsity: 0.6622 | Score: 0.9071 | Loss: 0.5138]
loss: 0.011156, lagrangian_loss: 0.000268, attention_score_distillation_loss: 0.000010
ETA: 7:17:25 | Epoch 31 finished. Took 3309.39 seconds.
loss: 0.007335, lagrangian_loss: 0.005575, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 20:02:16
Evaluating: accuracy: 0.9071, eval_loss: 0.5005, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 366000
lambda_1: -0.3659, lambda_2: 2106.5913 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.41 0.33 0.26 0.3  0.36 0.3 ]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.26, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10111111110011111111010110111101111010010100100100
10111111111111111000100110000000001000000000000000
10010011110110101011011010011100010001000000000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010001010100000001
10000000110010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9071 @ step 363000 epoch 31.92
loss: 0.009722, lagrangian_loss: 0.000107, attention_score_distillation_loss: 0.000010
loss: 0.310500, lagrangian_loss: 0.002695, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 20:16:34
Evaluating: accuracy: 0.9072, eval_loss: 0.5043, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 369000
lambda_1: -0.4041, lambda_2: 2122.8223 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.41 0.33 0.26 0.3  0.37 0.28]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111111011111111010110111101111010010100100100
10111111111111111000100110010000000000000000000000
10000111110110101011011010011100010001000000000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010001010100000001
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9071 @ step 363000 epoch 31.92
Saving the best model so far: [Epoch 32 | Step: 369000 | MACs sparsity: 0.6622 | Score: 0.9072 | Loss: 0.5043]
loss: 0.201731, lagrangian_loss: 0.000303, attention_score_distillation_loss: 0.000010
loss: 0.012930, lagrangian_loss: 0.001717, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 20:31:05
Evaluating: accuracy: 0.9068, eval_loss: 0.5058, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 372000
lambda_1: -0.4263, lambda_2: 2139.8025 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.41 0.33 0.26 0.29 0.34 0.3 ]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100010000000
10101111110011111111010110111101111110010100100100
10111111111111111000110110000000000000000000000000
10000111110110101011011010011100010001000000000000
10000001110010101001010010001100010001010100000000
10000010110010001001000010000100010001010100000000
10000000110010001001000010000100010001010100000000
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9072 @ step 369000 epoch 32.45
loss: 0.006624, lagrangian_loss: 0.010225, attention_score_distillation_loss: 0.000010
loss: 0.011726, lagrangian_loss: 0.000200, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 20:45:16
Evaluating: accuracy: 0.9066, eval_loss: 0.5218, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 375000
lambda_1: -0.6381, lambda_2: 2157.4548 lambda_3: 0.0000
train remain: [0.75 0.66 0.42 0.41 0.33 0.26 0.31 0.32 0.32]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000100000
10101111110011111111010110111101111010010101100100
10111111111111111000110110000000000000000000000000
10000011110110101011011010011100010001000100000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010001010110000000
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9072 @ step 369000 epoch 32.45
loss: 0.010423, lagrangian_loss: 0.002671, attention_score_distillation_loss: 0.000010
ETA: 6:22:48 | Epoch 32 finished. Took 3299.15 seconds.
loss: 0.005749, lagrangian_loss: 0.009991, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 20:59:29
Evaluating: accuracy: 0.9086, eval_loss: 0.4983, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 378000
lambda_1: -0.5325, lambda_2: 2174.3752 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.41 0.33 0.26 0.3  0.32 0.28]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110111111111010110111101111010010100100100
10111111111111111001100110000000000000000000000000
10000011110110101011011010011101010001000000000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010101010100000000
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9072 @ step 369000 epoch 32.45
Saving the best model so far: [Epoch 33 | Step: 378000 | MACs sparsity: 0.6622 | Score: 0.9086 | Loss: 0.4983]
loss: 0.005285, lagrangian_loss: 0.002330, attention_score_distillation_loss: 0.000010
loss: 0.022412, lagrangian_loss: 0.000735, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 21:14:11
Evaluating: accuracy: 0.9071, eval_loss: 0.5111, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 381000
lambda_1: -0.7453, lambda_2: 2191.4399 lambda_3: 0.0000
train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.29 0.35 0.27]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000100
10101111110011111111110110111101111010010100100100
10111111111111111000100110000001000000000000000000
10000011110110101011011010011101010001000000000000
10000001110010101001010010001100010001010100000000
10000000110010001001000010000100010101010100000000
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.327011, lagrangian_loss: 0.005363, attention_score_distillation_loss: 0.000010
loss: 0.007561, lagrangian_loss: 0.004921, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 21:28:21
Evaluating: accuracy: 0.907, eval_loss: 0.5143, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 384000
lambda_1: -0.2550, lambda_2: 2208.1333 lambda_3: 0.0000
train remain: [0.74 0.67 0.4  0.41 0.31 0.26 0.29 0.36 0.27]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111011110111101111010010100100100
10111111111111111000100110100000000000000000000000
10000011110110101011011010011100010001010000000000
10000001110010101001010010001100010000010100000000
10000000110010001001000010000100010101010100000000
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.006060, lagrangian_loss: 0.000020, attention_score_distillation_loss: 0.000010
loss: 0.008744, lagrangian_loss: 0.004687, attention_score_distillation_loss: 0.000010
ETA: 5:27:34 | Epoch 33 finished. Took 3097.55 seconds.
----------------------------------------------------------------------
time: 2023-07-20 21:42:37
Evaluating: accuracy: 0.9064, eval_loss: 0.5209, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 387000
lambda_1: -0.2488, lambda_2: 2225.4016 lambda_3: 0.0000
train remain: [0.75 0.67 0.4  0.41 0.32 0.26 0.3  0.43 0.27]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110111111111010110111101111010010100100100
10111111111111111000110110000000000000000000000000
10001011110110101011011010011100010001000000000000
10000001110010101001010010001100010000010100000000
10000000110010001001010010000100010001010100000000
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.011189, lagrangian_loss: 0.000375, attention_score_distillation_loss: 0.000010
loss: 0.008280, lagrangian_loss: 0.001082, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 21:56:52
Evaluating: accuracy: 0.9076, eval_loss: 0.4986, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 390000
lambda_1: -0.3095, lambda_2: 2242.7463 lambda_3: 0.0000
train remain: [0.75 0.67 0.4  0.4  0.31 0.26 0.3  0.41 0.28]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110111111111010110111101111010010100100100
10111111111111111000110110000000000000000000000000
10000011110110101011011010011100010001000100000000
10000001110010101001010010001100010001010000000000
10000000110010001001000010000100010001010000000011
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.011095, lagrangian_loss: 0.000680, attention_score_distillation_loss: 0.000010
loss: 0.008105, lagrangian_loss: 0.001251, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 22:11:06
Evaluating: accuracy: 0.907, eval_loss: 0.5086, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 393000
lambda_1: -0.1682, lambda_2: 2260.0564 lambda_3: 0.0000
train remain: [0.75 0.67 0.4  0.4  0.31 0.26 0.29 0.41 0.26]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110111111111010110111101111010010100100100
10111111111111111000100110000000000000010000000000
10000011110110101011011010011100010001000001000000
10000001110010101001010010001100010001010000000000
10000000110010101001000010000100010001010100000000
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.004605, lagrangian_loss: 0.015639, attention_score_distillation_loss: 0.000010
loss: 0.006394, lagrangian_loss: 0.011431, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 22:25:15
Evaluating: accuracy: 0.908, eval_loss: 0.5, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 396000
lambda_1: -0.6651, lambda_2: 2277.3845 lambda_3: 0.0000
train remain: [0.75 0.66 0.4  0.4  0.31 0.26 0.3  0.42 0.26]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111010110111111111010010100100100
10111111111111111000100110100000000000000000000000
10000011110110101011011010011100010001000000100000
10000000110010101001010010001100010001010100000000
10000000110010001001000010000100010001010100000001
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.020838, lagrangian_loss: 0.004873, attention_score_distillation_loss: 0.000010
loss: 0.013079, lagrangian_loss: 0.000335, attention_score_distillation_loss: 0.000010
ETA: 4:32:58 | Epoch 34 finished. Took 3274.9 seconds.
----------------------------------------------------------------------
time: 2023-07-20 22:39:28
Evaluating: accuracy: 0.9074, eval_loss: 0.4995, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 399000
lambda_1: -0.8593, lambda_2: 2294.6165 lambda_3: 0.0000
train remain: [0.76 0.67 0.39 0.39 0.31 0.26 0.31 0.5  0.26]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111110110111101111010010100100100
10111111111111111100100110000000000000000000000000
10000011110110101011011010011101010001000000000000
10000000110010101001010010001100010001010100000000
10000000110010001001010010000100010001010000010000
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.008430, lagrangian_loss: 0.000045, attention_score_distillation_loss: 0.000010
loss: 0.004441, lagrangian_loss: 0.000013, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 22:53:40
Evaluating: accuracy: 0.9077, eval_loss: 0.5016, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 402000
lambda_1: -0.3077, lambda_2: 2311.3203 lambda_3: 0.0000
train remain: [0.75 0.67 0.39 0.4  0.31 0.26 0.32 0.41 0.26]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111010110111101111010010100101100
10111111111111111000100110000000000000000000000100
10000011110110101011011010011100010001000000010000
10000000110010101001010010001100010001010100000000
10000000110010001001000010001100010001010100000000
10000000110010001001000010000100010001010100000000
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.012902, lagrangian_loss: 0.009601, attention_score_distillation_loss: 0.000010
loss: 0.012333, lagrangian_loss: 0.001954, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 23:07:52
Evaluating: accuracy: 0.9069, eval_loss: 0.5044, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 405000
lambda_1: -0.7087, lambda_2: 2328.8462 lambda_3: 0.0000
train remain: [0.76 0.66 0.39 0.39 0.31 0.26 0.32 0.37 0.25]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111111111111111110100000000000
10101111110011111111010110111101111010110100100100
10111111111111111000100110000000000000000010000000
10000011110110101011011010011100010001000010000000
10000000110010101001010010001100010001010100000000
10000000110010001001000010000100010001010000000011
10000000110010001001000010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.015108, lagrangian_loss: 0.008105, attention_score_distillation_loss: 0.000010
loss: 0.006259, lagrangian_loss: 0.001557, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 23:22:03
Evaluating: accuracy: 0.9063, eval_loss: 0.4933, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6532, expected_sequence_sparsity: 0.9195, target_sparsity: 0.65, step: 408000
lambda_1: -0.5158, lambda_2: 2345.6768 lambda_3: 0.0000
train remain: [0.76 0.66 0.39 0.39 0.32 0.26 0.33 0.39 0.25]
infer remain: [0.74, 0.66, 0.4, 0.38, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111111100000000000
10101111110111111111010110111101111010010100100100
10111111111111111000110110000000000000000000000000
10000011110110101011011010001100010101000000000000
10000000110010101001010010001100010001010100000000
10000000110010001011000010010100010001010000000000
10000000010010001001000010000100010001010100000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.007610, lagrangian_loss: 0.003020, attention_score_distillation_loss: 0.000010
ETA: 3:38:22 | Epoch 35 finished. Took 3274.06 seconds.
loss: 0.011378, lagrangian_loss: 0.009742, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 23:36:12
Evaluating: accuracy: 0.905, eval_loss: 0.4854, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 411000
lambda_1: -0.1714, lambda_2: 2362.8928 lambda_3: 0.0000
train remain: [0.76 0.66 0.39 0.39 0.33 0.26 0.31 0.35 0.25]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000010
10101111110111111111010110111101111010010100100100
00111111111111111000100110010000100000000000000000
10000011110110101011011010001100010001010000100000
10000000110010101001010010001100010001010100000000
10000000110010001001000010000100010001010110000000
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.014448, lagrangian_loss: 0.002844, attention_score_distillation_loss: 0.000010
loss: 0.005291, lagrangian_loss: 0.000073, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-20 23:50:25
Evaluating: accuracy: 0.906, eval_loss: 0.5015, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6543, expected_sequence_sparsity: 0.9198, target_sparsity: 0.65, step: 414000
lambda_1: -0.4862, lambda_2: 2379.9822 lambda_3: 0.0000
train remain: [0.76 0.66 0.39 0.39 0.33 0.26 0.28 0.32 0.26]
infer remain: [0.74, 0.66, 0.38, 0.38, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.19, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111010110111101111010010101100100
00111111111111111000100110000000001000000000000000
10000011110110101011011010001100010101000000000000
10000000110010101001010010001100010001010100000000
10000000110010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.005761, lagrangian_loss: 0.000847, attention_score_distillation_loss: 0.000010
loss: 0.020640, lagrangian_loss: 0.000082, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-21 00:04:36
Evaluating: accuracy: 0.9053, eval_loss: 0.4972, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6532, expected_sequence_sparsity: 0.9195, target_sparsity: 0.65, step: 417000
lambda_1: -0.3041, lambda_2: 2397.6174 lambda_3: 0.0000
train remain: [0.76 0.66 0.39 0.39 0.33 0.26 0.27 0.32 0.25]
infer remain: [0.74, 0.66, 0.4, 0.38, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000010
10101111110011111111011110111101111010010100100100
00111111111111111000110110000100000000000000000000
10001011110110101011011010001100010001000000000000
10000000110010101001010010001100010001010100000000
10000000110010001001000010001100010001010100000000
10000000010010001001000010000100010001010100000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.149664, lagrangian_loss: 0.001193, attention_score_distillation_loss: 0.000010
loss: 0.005796, lagrangian_loss: 0.002503, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-21 00:18:50
Evaluating: accuracy: 0.905, eval_loss: 0.4998, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6532, expected_sequence_sparsity: 0.9195, target_sparsity: 0.65, step: 420000
lambda_1: -0.2627, lambda_2: 2414.6377 lambda_3: 0.0000
train remain: [0.76 0.66 0.39 0.39 0.32 0.26 0.28 0.32 0.25]
infer remain: [0.74, 0.66, 0.4, 0.38, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110110000000000
10101111110011111111010110111101111011010100100100
00111111111111111100100110000100000000000000000000
10000011110110101011011010001101010001000000000000
10000000110010101001010010001100010001010100000000
10000001110010001001000010000100010001010100000000
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.014705, lagrangian_loss: 0.000021, attention_score_distillation_loss: 0.000010
ETA: 2:43:46 | Epoch 36 finished. Took 3272.29 seconds.
loss: 0.009032, lagrangian_loss: 0.010806, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-21 00:33:03
Evaluating: accuracy: 0.9047, eval_loss: 0.503, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6532, expected_sequence_sparsity: 0.9195, target_sparsity: 0.65, step: 423000
lambda_1: -0.2451, lambda_2: 2431.6763 lambda_3: 0.0000
train remain: [0.76 0.66 0.39 0.39 0.31 0.26 0.27 0.29 0.26]
infer remain: [0.74, 0.66, 0.4, 0.38, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000000010
10101111110011111111110110111101111010010100100100
00111111111111111000100110000000000000000000100100
10000011110110101011011010001100010101000000000000
10000000110010101001010010001100010001010100000000
10000000110010001011000010000100010001010100000000
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.107755, lagrangian_loss: 0.003145, attention_score_distillation_loss: 0.000010
loss: 0.015333, lagrangian_loss: 0.001036, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-21 00:47:16
Evaluating: accuracy: 0.9048, eval_loss: 0.4962, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6532, expected_sequence_sparsity: 0.9195, target_sparsity: 0.65, step: 426000
lambda_1: -0.3200, lambda_2: 2448.7913 lambda_3: 0.0000
train remain: [0.76 0.66 0.39 0.39 0.32 0.26 0.28 0.32 0.28]
infer remain: [0.74, 0.66, 0.4, 0.38, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100001000000
10101111110111111111010110111101111010010100100100
00111111111111111000100110000000100010000000000000
10000011110110101011011010001101010001000000000000
10000000110010101001010010001100010001010100000000
10000001110010001001000010000100010001010100000000
10000000010010001001000010000100010001010100000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.007883, lagrangian_loss: 0.010295, attention_score_distillation_loss: 0.000010
loss: 0.009252, lagrangian_loss: 0.000482, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-21 01:01:30
Evaluating: accuracy: 0.9046, eval_loss: 0.4996, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 429000
lambda_1: -0.2840, lambda_2: 2466.2822 lambda_3: 0.0000
train remain: [0.76 0.66 0.39 0.39 0.32 0.26 0.27 0.37 0.29]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111111100000000000
10101111110011111111010110111101111010110100100100
00111111111111111000100110000000100010000000000000
10000011110110101011011010001101010001010000000000
10000000110010101001010010001100010001010100000000
10000000110010001001000010000100010101010001000000
10000000010010001001010010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.009898, lagrangian_loss: 0.002330, attention_score_distillation_loss: 0.000010
loss: 0.001772, lagrangian_loss: 0.000038, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-21 01:15:37
Evaluating: accuracy: 0.9044, eval_loss: 0.5015, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 432000
lambda_1: -0.0957, lambda_2: 2483.9102 lambda_3: 0.0000
train remain: [0.76 0.65 0.39 0.39 0.32 0.26 0.26 0.49 0.32]
infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111111100000000000
10101111110011111111011110111101111010010100100100
00111111111111111101100110000000000000000000000000
10000011110110101011011010001100010001000101000000
10000000110010101001010010001100010001010100000000
10000000110010101001000010000100010001010100000000
10000000010010001001010010000100010001010000000001
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.003926, lagrangian_loss: 0.000002, attention_score_distillation_loss: 0.000010
ETA: 1:49:11 | Epoch 37 finished. Took 3272.06 seconds.
loss: 0.007299, lagrangian_loss: 0.000688, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-21 01:29:49
Evaluating: accuracy: 0.9042, eval_loss: 0.5107, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6649, expected_sparsity: 0.6552, expected_sequence_sparsity: 0.92, target_sparsity: 0.65, step: 435000
lambda_1: -0.3987, lambda_2: 2501.2861 lambda_3: 0.0000
train remain: [0.76 0.65 0.39 0.39 0.3  0.26 0.26 0.45 0.32]
infer remain: [0.74, 0.64, 0.4, 0.38, 0.28, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.47, 0.19, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111111100000000000
10101111110011111111010110111101111010010100100100
00111111111111111010100110000000000000000000100000
10000111110110101011011010001100010001000000000000
10000000110010101001010010001100010001010000000000
10000000110010001001000010001101010001010000000000
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.121294, lagrangian_loss: 0.006647, attention_score_distillation_loss: 0.000010
loss: 0.006836, lagrangian_loss: 0.000367, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-21 01:44:03
Evaluating: accuracy: 0.9053, eval_loss: 0.5002, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6649, expected_sparsity: 0.6552, expected_sequence_sparsity: 0.92, target_sparsity: 0.65, step: 438000
lambda_1: -0.3716, lambda_2: 2518.7251 lambda_3: 0.0000
train remain: [0.77 0.65 0.39 0.39 0.29 0.26 0.25 0.42 0.31]
infer remain: [0.74, 0.64, 0.4, 0.38, 0.28, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.47, 0.19, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000100000
10101111110011111111010110111101111010010100100100
00111111111111111000110110000100000000000000000000
10000011110110101011011010001100010001000010000000
10000000110010101001010010001100010001010000000000
10000000110010001001000010001100010001010100000000
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.005267, lagrangian_loss: 0.004462, attention_score_distillation_loss: 0.000010
loss: 0.004606, lagrangian_loss: 0.000595, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-21 01:58:12
Evaluating: accuracy: 0.9041, eval_loss: 0.5077, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6649, expected_sparsity: 0.6552, expected_sequence_sparsity: 0.92, target_sparsity: 0.65, step: 441000
lambda_1: -0.2506, lambda_2: 2535.8037 lambda_3: 0.0000
train remain: [0.77 0.65 0.4  0.39 0.29 0.26 0.25 0.45 0.33]
infer remain: [0.74, 0.64, 0.4, 0.38, 0.28, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.47, 0.19, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0]
10111111111111111111111110111111111110100000100000
10101111110011111111010110111101111010010100100100
00111111111111111000100110000101000000000000000000
10000011110110101011011010001100010001100000000000
10000000110010101001010010000100010001010000001000
10000001110010001001010010000100010001010000000000
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.013767, lagrangian_loss: 0.000631, attention_score_distillation_loss: 0.000010
loss: 0.013047, lagrangian_loss: 0.001207, attention_score_distillation_loss: 0.000010
ETA: 0:54:30 | Epoch 38 finished. Took 3064.35 seconds.
----------------------------------------------------------------------
time: 2023-07-21 02:12:26
Evaluating: accuracy: 0.9066, eval_loss: 0.5079, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6649, expected_sparsity: 0.6564, expected_sequence_sparsity: 0.9203, target_sparsity: 0.65, step: 444000
lambda_1: -0.7933, lambda_2: 2553.1509 lambda_3: 0.0000
train remain: [0.77 0.64 0.39 0.39 0.28 0.26 0.24 0.39 0.27]
infer remain: [0.74, 0.64, 0.38, 0.38, 0.28, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.47, 0.18, 0.07, 0.02, 0.0, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111010110111101111010010100001100
00111111111111111001100110000000000000000000000000
10000011110110101011011010001100010101000000000000
10000000110010101001010010000100010001010100000000
10000000110010001001010010000100010001010010000000
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.007001, lagrangian_loss: 0.000182, attention_score_distillation_loss: 0.000010
loss: 0.317712, lagrangian_loss: 0.014128, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-21 02:26:37
Evaluating: accuracy: 0.9062, eval_loss: 0.498, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6649, expected_sparsity: 0.6564, expected_sequence_sparsity: 0.9203, target_sparsity: 0.65, step: 447000
lambda_1: -0.5744, lambda_2: 2570.6865 lambda_3: 0.0000
train remain: [0.77 0.64 0.39 0.39 0.28 0.26 0.24 0.4  0.28]
infer remain: [0.74, 0.64, 0.38, 0.38, 0.28, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.47, 0.18, 0.07, 0.02, 0.0, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111011110111101111010010100000100
00111111111111111010100110000000000000000000000000
10000011110110101011011010001101010001000000000000
10000000110010101011010010000100010001010000000000
10000000110010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.003147, lagrangian_loss: 0.000686, attention_score_distillation_loss: 0.000010
loss: 0.247291, lagrangian_loss: 0.001221, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-21 02:40:49
Evaluating: accuracy: 0.9072, eval_loss: 0.5071, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6649, expected_sparsity: 0.6552, expected_sequence_sparsity: 0.92, target_sparsity: 0.65, step: 450000
lambda_1: -0.3225, lambda_2: 2587.6221 lambda_3: 0.0000
train remain: [0.77 0.64 0.39 0.39 0.27 0.26 0.24 0.46 0.27]
infer remain: [0.74, 0.64, 0.4, 0.38, 0.28, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.47, 0.19, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
11101111110011111111010110111101111010010100000100
00111111111111111010100110000000000000010000000000
10000011110110101011011010001100011001000000000000
10000000110010101001000010010100010001010010000000
10000000110010001001000010000100010001010100100000
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.020886, lagrangian_loss: 0.001091, attention_score_distillation_loss: 0.000010
loss: 0.011388, lagrangian_loss: 0.000267, attention_score_distillation_loss: 0.000010
----------------------------------------------------------------------
time: 2023-07-21 02:54:59
Evaluating: accuracy: 0.9062, eval_loss: 0.5078, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6649, expected_sparsity: 0.6565, expected_sequence_sparsity: 0.9203, target_sparsity: 0.65, step: 453000
lambda_1: -0.6912, lambda_2: 2605.0227 lambda_3: 0.0000
train remain: [0.78 0.64 0.39 0.39 0.27 0.26 0.24 0.36 0.26]
infer remain: [0.74, 0.64, 0.38, 0.38, 0.26, 0.26, 0.24, 0.24, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.47, 0.18, 0.07, 0.02, 0.0, 0.0, 0.0, 0.0]
11111111111111111111111110111111111110100000000000
10101111110011111111010110111101111010010100000101
00111111111111111000100110000100000000000000000000
10000011110110101011011010011100010001000000000000
10000000110010101001000010000100010001010100000000
10000000110010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
10000000010010001001000010000100010001010000000011
Best eval score so far: 0.9086 @ step 378000 epoch 33.24
loss: 0.007881, lagrangian_loss: 0.006042, attention_score_distillation_loss: 0.000010
loss: 0.003655, lagrangian_loss: 0.000986, attention_score_distillation_loss: 0.000010
ETA: 0:00:00 | Epoch 39 finished. Took 3270.83 seconds.
07/21/2023 03:03:39 - WARNING - urllib3.connectionpool - Retrying (Retry(total=4, connect=5, read=4, redirect=5, status=5)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='southcentralus.api.azureml.ms', port=443): Read timed out. (read timeout=120)")': /mlflow/v2.0/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourceGroups/gcr-singularity-octo/providers/Microsoft.MachineLearningServices/workspaces/msroctows/api/2.0/mlflow/runs/get?run_uuid=9a5a65ea-641a-4e71-bf7b-573708e6a20c&run_id=9a5a65ea-641a-4e71-bf7b-573708e6a20c