File size: 203,462 Bytes

1c6388f

/home/aiscuser/.local/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
2023/07/19 14:47:58 WARNING mlflow.utils.autologging_utils: You are using an unsupported version of transformers. If you encounter errors during autologging, try upgrading / downgrading transformers to a supported version, or try upgrading MLflow.
2023/07/19 14:47:59 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2023/07/19 14:47:59 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers.
Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Downloading and preparing dataset glue/sst2 to /home/aiscuser/.cache/huggingface/datasets/glue/sst2/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353...

Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]
Downloading data: 100%|██████████| 7.44M/7.44M [00:00<00:00, 102MB/s]

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 3252 examples [00:00, 32407.22 examples/s]
Generating train split: 6895 examples [00:00, 34762.21 examples/s]
Generating train split: 10458 examples [00:00, 35151.24 examples/s]
Generating train split: 14029 examples [00:00, 35369.90 examples/s]
Generating train split: 17678 examples [00:00, 35768.48 examples/s]
Generating train split: 23024 examples [00:00, 35711.50 examples/s]
Generating train split: 26684 examples [00:00, 35968.74 examples/s]
Generating train split: 32035 examples [00:00, 35851.35 examples/s]
Generating train split: 35672 examples [00:01, 35987.21 examples/s]
Generating train split: 41005 examples [00:01, 35825.62 examples/s]
Generating train split: 44664 examples [00:01, 36021.98 examples/s]
Generating train split: 50006 examples [00:01, 35874.81 examples/s]
Generating train split: 53627 examples [00:01, 35959.17 examples/s]
Generating train split: 58920 examples [00:01, 35718.07 examples/s]
Generating train split: 64260 examples [00:01, 35678.95 examples/s]
                                                                   

Generating validation split: 0 examples [00:00, ? examples/s]
                                                             

Generating test split: 0 examples [00:00, ? examples/s]
                                                       
Dataset glue downloaded and prepared to /home/aiscuser/.cache/huggingface/datasets/glue/sst2/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353. Subsequent calls will reuse this data.

  0%|          | 0/3 [00:00<?, ?it/s]
100%|██████████| 3/3 [00:00<00:00, 513.74it/s]
disable token pruning.
enable token pruning. token_prune_loc: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
NOTICE: THIS IS PRUNING STAGE
max_seq_length: 256

Running tokenizer on dataset:   0%|          | 0/67349 [00:00<?, ? examples/s]
Running tokenizer on dataset:   1%|▏         | 1000/67349 [00:00<00:11, 5799.28 examples/s]
Running tokenizer on dataset:   3%|▎         | 2000/67349 [00:00<00:10, 6386.28 examples/s]
Running tokenizer on dataset:   4%|▍         | 3000/67349 [00:00<00:09, 6512.27 examples/s]
Running tokenizer on dataset:   6%|▌         | 4000/67349 [00:00<00:09, 6478.20 examples/s]
Running tokenizer on dataset:   7%|▋         | 5000/67349 [00:00<00:09, 6285.38 examples/s]
Running tokenizer on dataset:   9%|▉         | 6000/67349 [00:00<00:10, 6052.81 examples/s]
Running tokenizer on dataset:  10%|█         | 7000/67349 [00:01<00:10, 5998.89 examples/s]
Running tokenizer on dataset:  12%|█▏        | 8000/67349 [00:01<00:09, 5971.07 examples/s]
Running tokenizer on dataset:  13%|█▎        | 9000/67349 [00:01<00:09, 5952.86 examples/s]
Running tokenizer on dataset:  15%|█▍        | 10000/67349 [00:01<00:14, 3954.10 examples/s]
Running tokenizer on dataset:  16%|█▋        | 11000/67349 [00:02<00:12, 4397.04 examples/s]
Running tokenizer on dataset:  18%|█▊        | 12000/67349 [00:02<00:11, 4792.91 examples/s]
Running tokenizer on dataset:  19%|█▉        | 13000/67349 [00:02<00:10, 5093.44 examples/s]
Running tokenizer on dataset:  21%|██        | 14000/67349 [00:02<00:10, 5327.08 examples/s]
Running tokenizer on dataset:  22%|██▏       | 15000/67349 [00:02<00:09, 5488.71 examples/s]
Running tokenizer on dataset:  24%|██▍       | 16000/67349 [00:02<00:09, 5143.99 examples/s]
Running tokenizer on dataset:  25%|██▌       | 17000/67349 [00:03<00:09, 5395.24 examples/s]
Running tokenizer on dataset:  27%|██▋       | 18000/67349 [00:03<00:08, 5614.11 examples/s]
Running tokenizer on dataset:  28%|██▊       | 19000/67349 [00:03<00:08, 5769.14 examples/s]
Running tokenizer on dataset:  30%|██▉       | 20000/67349 [00:03<00:08, 5917.49 examples/s]
Running tokenizer on dataset:  31%|███       | 21000/67349 [00:03<00:07, 5939.51 examples/s]
Running tokenizer on dataset:  33%|███▎      | 22000/67349 [00:03<00:07, 5939.32 examples/s]
Running tokenizer on dataset:  34%|███▍      | 23000/67349 [00:04<00:07, 5954.34 examples/s]
Running tokenizer on dataset:  36%|███▌      | 24000/67349 [00:04<00:07, 6088.03 examples/s]
Running tokenizer on dataset:  37%|███▋      | 25000/67349 [00:04<00:06, 6109.50 examples/s]
Running tokenizer on dataset:  39%|███▊      | 26000/67349 [00:04<00:06, 6121.70 examples/s]
Running tokenizer on dataset:  40%|████      | 27000/67349 [00:04<00:06, 6030.11 examples/s]
Running tokenizer on dataset:  42%|████▏     | 28000/67349 [00:04<00:06, 6132.12 examples/s]
Running tokenizer on dataset:  43%|████▎     | 29000/67349 [00:05<00:06, 6121.56 examples/s]
Running tokenizer on dataset:  45%|████▍     | 30000/67349 [00:05<00:06, 6041.48 examples/s]
Running tokenizer on dataset:  46%|████▌     | 31000/67349 [00:05<00:09, 4021.96 examples/s]
Running tokenizer on dataset:  48%|████▊     | 32000/67349 [00:05<00:07, 4431.36 examples/s]
Running tokenizer on dataset:  49%|████▉     | 33000/67349 [00:06<00:07, 4755.44 examples/s]
Running tokenizer on dataset:  50%|█████     | 34000/67349 [00:06<00:06, 4969.84 examples/s]
Running tokenizer on dataset:  52%|█████▏    | 35000/67349 [00:06<00:06, 5262.45 examples/s]
Running tokenizer on dataset:  53%|█████▎    | 36000/67349 [00:06<00:05, 5478.03 examples/s]
Running tokenizer on dataset:  55%|█████▍    | 37000/67349 [00:06<00:05, 5616.02 examples/s]
Running tokenizer on dataset:  56%|█████▋    | 38000/67349 [00:06<00:05, 5790.96 examples/s]
Running tokenizer on dataset:  58%|█████▊    | 39000/67349 [00:07<00:04, 5914.36 examples/s]
Running tokenizer on dataset:  59%|█████▉    | 40000/67349 [00:07<00:04, 5974.41 examples/s]
Running tokenizer on dataset:  61%|██████    | 41000/67349 [00:07<00:04, 5986.74 examples/s]
Running tokenizer on dataset:  62%|██████▏   | 42000/67349 [00:07<00:04, 6025.06 examples/s]
Running tokenizer on dataset:  64%|██████▍   | 43000/67349 [00:07<00:04, 6029.23 examples/s]
Running tokenizer on dataset:  65%|██████▌   | 44000/67349 [00:07<00:03, 6092.54 examples/s]
Running tokenizer on dataset:  67%|██████▋   | 45000/67349 [00:08<00:03, 6140.79 examples/s]
Running tokenizer on dataset:  68%|██████▊   | 46000/67349 [00:08<00:03, 6126.67 examples/s]
Running tokenizer on dataset:  70%|██████▉   | 47000/67349 [00:08<00:03, 6176.43 examples/s]
Running tokenizer on dataset:  71%|███████▏  | 48000/67349 [00:08<00:03, 6210.81 examples/s]
Running tokenizer on dataset:  73%|███████▎  | 49000/67349 [00:08<00:02, 6168.68 examples/s]
Running tokenizer on dataset:  74%|███████▍  | 50000/67349 [00:09<00:03, 4392.71 examples/s]
Running tokenizer on dataset:  76%|███████▌  | 51000/67349 [00:09<00:03, 4800.89 examples/s]
Running tokenizer on dataset:  77%|███████▋  | 52000/67349 [00:09<00:02, 5131.28 examples/s]
Running tokenizer on dataset:  79%|███████▊  | 53000/67349 [00:09<00:02, 5435.35 examples/s]
Running tokenizer on dataset:  80%|████████  | 54000/67349 [00:09<00:02, 5679.04 examples/s]
Running tokenizer on dataset:  82%|████████▏ | 55000/67349 [00:09<00:02, 5836.45 examples/s]
Running tokenizer on dataset:  83%|████████▎ | 56000/67349 [00:10<00:01, 5913.32 examples/s]
Running tokenizer on dataset:  85%|████████▍ | 57000/67349 [00:10<00:01, 5977.52 examples/s]
Running tokenizer on dataset:  86%|████████▌ | 58000/67349 [00:10<00:01, 6001.72 examples/s]
Running tokenizer on dataset:  88%|████████▊ | 59000/67349 [00:10<00:01, 6056.87 examples/s]
Running tokenizer on dataset:  89%|████████▉ | 60000/67349 [00:10<00:01, 6086.40 examples/s]
Running tokenizer on dataset:  91%|█████████ | 61000/67349 [00:10<00:01, 6195.38 examples/s]
Running tokenizer on dataset:  92%|█████████▏| 62000/67349 [00:11<00:00, 6189.81 examples/s]
Running tokenizer on dataset:  94%|█████████▎| 63000/67349 [00:11<00:00, 6241.52 examples/s]
Running tokenizer on dataset:  95%|█████████▌| 64000/67349 [00:11<00:00, 6221.25 examples/s]
Running tokenizer on dataset:  97%|█████████▋| 65000/67349 [00:11<00:00, 6257.20 examples/s]
Running tokenizer on dataset:  98%|█████████▊| 66000/67349 [00:11<00:00, 6296.92 examples/s]
Running tokenizer on dataset:  99%|█████████▉| 67000/67349 [00:11<00:00, 6308.59 examples/s]
                                                                                            

Running tokenizer on dataset:   0%|          | 0/872 [00:00<?, ? examples/s]
Running tokenizer on dataset: 100%|██████████| 872/872 [00:00<00:00, 2704.90 examples/s]
                                                                                        

Running tokenizer on dataset:   0%|          | 0/1821 [00:00<?, ? examples/s]
Running tokenizer on dataset:  55%|█████▍    | 1000/1821 [00:00<00:00, 5370.09 examples/s]
Running tokenizer on dataset: 100%|██████████| 1821/1821 [00:00<00:00, 5386.07 examples/s]
                                                                                          

Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]
Downloading builder script: 5.76kB [00:00, 5.12MB/s]                   
double check the prune location is loaded correctly: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
double check hard_token_mask: <class 'NoneType'>
Training Arguments
TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=500,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=1e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=40,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=/mnt/data/device-aware-bert/token_pruning/experiments/SST2/reproduce1/s0.4_lr1e-05_reglr0.04_alpha0.001_warmup10_bin25/runs/Jul19_14-48-00_node-0,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=100,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=40.0,
optim=OptimizerNames.ADAMW_HF,
output_dir=/mnt/data/device-aware-bert/token_pruning/experiments/SST2/reproduce1/s0.4_lr1e-05_reglr0.04_alpha0.001_warmup10_bin25,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=32,
per_device_train_batch_size=32,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
remove_unused_columns=True,
report_to=['mlflow'],
resume_from_checkpoint=None,
run_name=/mnt/data/device-aware-bert/token_pruning/experiments/SST2/reproduce1/s0.4_lr1e-05_reglr0.04_alpha0.001_warmup10_bin25,
save_on_each_node=False,
save_steps=0,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=57,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
Additional Arguments
AdditionalArguments(test=False, ex_name='s0.4_lr1e-05_reglr0.04_alpha0.001_warmup10_bin25', pruning_type='token+pruner', reg_learning_rate=0.04, scheduler_type='linear', freeze_embeddings=True, pretrained_pruned_model=None, droprate_init=0.01, temperature=0.6666666666666666, prepruning_finetune_epochs=1, lagrangian_warmup_epochs=10, target_sparsity=0.4, sparsity_epsilon=0, distillation_path='/mnt/data/device-aware-bert/token_pruning/teachers/SST2', do_distill=True, do_layer_distill=False, layer_distill_version=4, distill_loss_alpha=0.9, distill_ce_loss_alpha=0.001, distill_temp=2.0, use_mac_l0=True, prune_location=[2, 3, 4, 5, 6, 7, 8, 9, 10, 11], bin_num=25, topk=20)
----------------------------------------------------------------------
time: 2023-07-19 14:48:57
Evaluating: accuracy: 0.9323, eval_loss: 0.2955, step: 0
lambda_1: 0.0000, lambda_2: 0.0000 lambda_3: 0.0000
Starting l0 regularization! using <class 'models.l0_module.L0ModuleForMAC'>, temperature: 0.67, init drop rate: 0.01 token_loga shape: [10, 25] prune location: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
NDCG TOPK= 20
loss: 0.155357, lagrangian_loss: 0.004811, attention_score_distillation_loss: 0.005180
----------------------------------------------------------------------
time: 2023-07-19 14:50:27
Evaluating: accuracy: 0.9289, eval_loss: 0.3444, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.6119, target_sparsity: 0.0095, step: 500
lambda_1: 1.3281, lambda_2: 9.5986 lambda_3: 0.0000
train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
loss: 0.006962, lagrangian_loss: -0.008403, attention_score_distillation_loss: 0.004522
loss: 0.040866, lagrangian_loss: -0.021373, attention_score_distillation_loss: 0.006085
----------------------------------------------------------------------
time: 2023-07-19 14:51:56
Evaluating: accuracy: 0.9289, eval_loss: 0.3043, token_prune_loc: [False, False, False, False, False, False, False, False, True, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.6119, target_sparsity: 0.019, step: 1000
lambda_1: -2.0315, lambda_2: 19.8813 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   1.   1.   0.99 1.  ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
loss: 0.028482, lagrangian_loss: 0.036377, attention_score_distillation_loss: 0.004161
loss: 0.183112, lagrangian_loss: 0.003172, attention_score_distillation_loss: 0.004896
----------------------------------------------------------------------
time: 2023-07-19 14:53:27
Evaluating: accuracy: 0.9278, eval_loss: 0.3021, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0178, expected_sparsity: 0.0169, expected_sequence_sparsity: 0.6185, target_sparsity: 0.0285, step: 1500
lambda_1: 1.0045, lambda_2: 25.0687 lambda_3: 0.0000
train remain: [1.   0.99 1.   1.   1.   1.   1.   0.99 0.91 0.99]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110111110111
1111111111111111111111111
loss: 0.021260, lagrangian_loss: -0.000303, attention_score_distillation_loss: 0.005491
loss: 0.176446, lagrangian_loss: 0.002407, attention_score_distillation_loss: 0.004426
----------------------------------------------------------------------
time: 2023-07-19 14:54:57
Evaluating: accuracy: 0.9255, eval_loss: 0.3446, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0323, expected_sparsity: 0.0301, expected_sequence_sparsity: 0.6236, target_sparsity: 0.038, step: 2000
lambda_1: 0.1692, lambda_2: 25.6928 lambda_3: 0.0000
train remain: [1.   0.99 1.   1.   1.   1.   1.   0.99 0.88 0.95]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.96]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.84]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110111110011
1111111111111011111111111
loss: 0.008688, lagrangian_loss: 0.000906, attention_score_distillation_loss: 0.005019
ETA: 4:05:57 | Epoch 0 finished. Took 378.39 seconds.
loss: 0.029153, lagrangian_loss: 0.000659, attention_score_distillation_loss: 0.005779
----------------------------------------------------------------------
time: 2023-07-19 14:56:27
Evaluating: accuracy: 0.9243, eval_loss: 0.321, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0523, expected_sparsity: 0.0472, expected_sequence_sparsity: 0.6303, target_sparsity: 0.0475, step: 2500
lambda_1: 0.1178, lambda_2: 26.1422 lambda_3: 0.0000
train remain: [1.   0.99 1.   1.   1.   1.   1.   0.98 0.84 0.9 ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.88]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.74]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111011111110111110011
1111111111111001101111111
loss: 0.225283, lagrangian_loss: 0.000981, attention_score_distillation_loss: 0.004351
loss: 0.014326, lagrangian_loss: 0.000840, attention_score_distillation_loss: 0.004253
----------------------------------------------------------------------
time: 2023-07-19 14:57:57
Evaluating: accuracy: 0.9255, eval_loss: 0.3366, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0612, expected_sparsity: 0.0592, expected_sequence_sparsity: 0.6351, target_sparsity: 0.057, step: 3000
lambda_1: 0.4644, lambda_2: 27.2006 lambda_3: 0.0000
train remain: [1.   0.99 1.   1.   1.   1.   1.   0.98 0.82 0.87]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.84]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.67]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111011111110111110001
1111111111111001101111011
loss: 0.108720, lagrangian_loss: 0.000790, attention_score_distillation_loss: 0.003481
loss: 0.080188, lagrangian_loss: 0.007772, attention_score_distillation_loss: 0.004560
----------------------------------------------------------------------
time: 2023-07-19 14:59:27
Evaluating: accuracy: 0.9266, eval_loss: 0.3362, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0612, expected_sparsity: 0.0592, expected_sequence_sparsity: 0.6351, target_sparsity: 0.0665, step: 3500
lambda_1: -1.0033, lambda_2: 30.6546 lambda_3: 0.0000
train remain: [1.   0.99 1.   1.   1.   1.   1.   0.99 0.81 0.86]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.84]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.67]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111011111110111110001
1111111111111001101111011
loss: 0.087554, lagrangian_loss: 0.006383, attention_score_distillation_loss: 0.004199
loss: 0.010215, lagrangian_loss: 0.002603, attention_score_distillation_loss: 0.003189
----------------------------------------------------------------------
time: 2023-07-19 15:00:57
Evaluating: accuracy: 0.93, eval_loss: 0.3314, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0667, expected_sparsity: 0.0635, expected_sequence_sparsity: 0.6367, target_sparsity: 0.076, step: 4000
lambda_1: -0.5176, lambda_2: 33.8813 lambda_3: 0.0000
train remain: [1.   0.99 1.   1.   1.   1.   1.   0.99 0.81 0.78]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.8]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111011111110111110001
1111111111111001101101011
loss: 0.009859, lagrangian_loss: 0.006298, attention_score_distillation_loss: 0.004333
ETA: 3:59:36 | Epoch 1 finished. Took 378.25 seconds.
loss: 0.051151, lagrangian_loss: 0.001607, attention_score_distillation_loss: 0.004309
----------------------------------------------------------------------
time: 2023-07-19 15:02:27
Evaluating: accuracy: 0.9266, eval_loss: 0.3515, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0834, expected_sparsity: 0.0761, expected_sequence_sparsity: 0.6417, target_sparsity: 0.0855, step: 4500
lambda_1: -0.0091, lambda_2: 36.8318 lambda_3: 0.0000
train remain: [1.   0.99 1.   1.   1.   1.   1.   0.99 0.79 0.67]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.68]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.54]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111011111110111110001
1111111101111000001101011
loss: 0.008536, lagrangian_loss: 0.000317, attention_score_distillation_loss: 0.004327
loss: 0.036679, lagrangian_loss: 0.001312, attention_score_distillation_loss: 0.004933
----------------------------------------------------------------------
time: 2023-07-19 15:03:57
Evaluating: accuracy: 0.9266, eval_loss: 0.3407, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0923, expected_sparsity: 0.0909, expected_sequence_sparsity: 0.6475, target_sparsity: 0.095, step: 5000
lambda_1: 0.4635, lambda_2: 37.8125 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   1.   0.98 0.76 0.6 ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.46]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111011111110111110000
1011111101111000001001011
loss: 0.003988, lagrangian_loss: -0.000048, attention_score_distillation_loss: 0.004565
loss: 0.121249, lagrangian_loss: 0.002231, attention_score_distillation_loss: 0.003795
----------------------------------------------------------------------
time: 2023-07-19 15:05:27
Evaluating: accuracy: 0.9312, eval_loss: 0.3027, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1122, expected_sparsity: 0.1048, expected_sequence_sparsity: 0.6529, target_sparsity: 0.1045, step: 5500
lambda_1: -1.4023, lambda_2: 40.6654 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   1.   0.96 0.73 0.52]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.52]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.37]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111011111110111100000
1011110001111000001001011
loss: 0.098499, lagrangian_loss: -0.004970, attention_score_distillation_loss: 0.003548
loss: 0.051845, lagrangian_loss: 0.002930, attention_score_distillation_loss: 0.003697
----------------------------------------------------------------------
time: 2023-07-19 15:06:57
Evaluating: accuracy: 0.9243, eval_loss: 0.2995, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1122, expected_sparsity: 0.1086, expected_sequence_sparsity: 0.6544, target_sparsity: 0.114, step: 6000
lambda_1: 0.6548, lambda_2: 46.3613 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   1.   0.98 0.73 0.49]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.35]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111011111110111100000
1011100001111000001001011
loss: 0.007393, lagrangian_loss: -0.001965, attention_score_distillation_loss: 0.002874
loss: 0.061582, lagrangian_loss: -0.001227, attention_score_distillation_loss: 0.003754
ETA: 3:53:00 | Epoch 2 finished. Took 376.91 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:08:26
Evaluating: accuracy: 0.9312, eval_loss: 0.3044, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.1211, expected_sparsity: 0.1179, expected_sequence_sparsity: 0.6581, target_sparsity: 0.1235, step: 6500
lambda_1: 0.3385, lambda_2: 50.3146 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   1.   0.95 0.68 0.42]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.68, 0.3]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111101011111110111100000
1011110001101000001000011
loss: 0.014733, lagrangian_loss: 0.000137, attention_score_distillation_loss: 0.003701
loss: 0.155054, lagrangian_loss: -0.000038, attention_score_distillation_loss: 0.003613
----------------------------------------------------------------------
time: 2023-07-19 15:09:57
Evaluating: accuracy: 0.9278, eval_loss: 0.3334, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1556, expected_sparsity: 0.1486, expected_sequence_sparsity: 0.6701, target_sparsity: 0.133, step: 7000
lambda_1: 0.0394, lambda_2: 52.7942 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   1.   0.89 0.67 0.38]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.68, 0.4]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.57, 0.23]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111011111110100
1111101011111110111100000
1011100001101000001000011
loss: 0.179367, lagrangian_loss: 0.003221, attention_score_distillation_loss: 0.004066
loss: 0.031130, lagrangian_loss: 0.001172, attention_score_distillation_loss: 0.003428
----------------------------------------------------------------------
time: 2023-07-19 15:11:26
Evaluating: accuracy: 0.9335, eval_loss: 0.3093, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1611, expected_sparsity: 0.1546, expected_sequence_sparsity: 0.6724, target_sparsity: 0.1425, step: 7500
lambda_1: -0.5608, lambda_2: 56.8670 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   1.   0.88 0.66 0.34]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.68, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.57, 0.18]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111011111110100
1111101011111110111100000
1011100001101000000000010
loss: 0.009305, lagrangian_loss: -0.001350, attention_score_distillation_loss: 0.004126
loss: 0.004649, lagrangian_loss: -0.000004, attention_score_distillation_loss: 0.002670
----------------------------------------------------------------------
time: 2023-07-19 15:12:56
Evaluating: accuracy: 0.9323, eval_loss: 0.3185, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1733, expected_sparsity: 0.1649, expected_sequence_sparsity: 0.6765, target_sparsity: 0.152, step: 8000
lambda_1: 0.5161, lambda_2: 62.2754 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   1.   0.84 0.65 0.34]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.51, 0.16]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111011110110100
1111100011111110111100000
1011100001101000000000010
loss: 0.172671, lagrangian_loss: -0.001068, attention_score_distillation_loss: 0.002684
loss: 0.014831, lagrangian_loss: 0.000837, attention_score_distillation_loss: 0.003171
ETA: 3:46:39 | Epoch 3 finished. Took 377.48 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:14:26
Evaluating: accuracy: 0.9289, eval_loss: 0.306, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.18, expected_sparsity: 0.1749, expected_sequence_sparsity: 0.6804, target_sparsity: 0.1615, step: 8500
lambda_1: -0.3289, lambda_2: 66.5334 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   0.99 0.79 0.62 0.31]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.6, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.46, 0.15]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111011110100100
1111100011111110111000000
1011100001101000000000010
loss: 0.024292, lagrangian_loss: 0.000257, attention_score_distillation_loss: 0.002809
loss: 0.003360, lagrangian_loss: 0.000544, attention_score_distillation_loss: 0.003390
----------------------------------------------------------------------
time: 2023-07-19 15:15:56
Evaluating: accuracy: 0.9323, eval_loss: 0.3061, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1922, expected_sparsity: 0.1833, expected_sequence_sparsity: 0.6837, target_sparsity: 0.171, step: 9000
lambda_1: -0.0414, lambda_2: 70.2247 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   0.99 0.76 0.6  0.27]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.6, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.43, 0.12]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111011110100000
1111100011111110111000000
1011100001101000000000000
loss: 0.018423, lagrangian_loss: 0.000380, attention_score_distillation_loss: 0.003018
loss: 0.006017, lagrangian_loss: 0.000093, attention_score_distillation_loss: 0.003399
----------------------------------------------------------------------
time: 2023-07-19 15:17:26
Evaluating: accuracy: 0.9266, eval_loss: 0.3326, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1955, expected_sparsity: 0.1887, expected_sequence_sparsity: 0.6858, target_sparsity: 0.1805, step: 9500
lambda_1: -0.3810, lambda_2: 74.8933 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 1.   0.99 1.   0.74 0.58 0.24]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.56, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.4, 0.1]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111011110100000
1111100011011110111000000
1001100001101000000000000
loss: 0.014049, lagrangian_loss: -0.000359, attention_score_distillation_loss: 0.002617
loss: 0.008354, lagrangian_loss: -0.003637, attention_score_distillation_loss: 0.003165
----------------------------------------------------------------------
time: 2023-07-19 15:18:56
Evaluating: accuracy: 0.9266, eval_loss: 0.3274, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1922, expected_sparsity: 0.1855, expected_sequence_sparsity: 0.6846, target_sparsity: 0.19, step: 10000
lambda_1: 0.2728, lambda_2: 84.7973 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.99 0.99 0.99 0.99 0.74 0.6  0.26]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.6, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.43, 0.1]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111011110100000
1111100011011110111000010
1001100000101010000000000
loss: 0.035517, lagrangian_loss: -0.000120, attention_score_distillation_loss: 0.003169
loss: 0.009681, lagrangian_loss: 0.000505, attention_score_distillation_loss: 0.002792
----------------------------------------------------------------------
time: 2023-07-19 15:20:26
Evaluating: accuracy: 0.93, eval_loss: 0.3142, token_prune_loc: [False, False, False, False, True, False, False, True, True, True], macs_sparsity: 0.2256, expected_sparsity: 0.216, expected_sequence_sparsity: 0.6965, target_sparsity: 0.1995, step: 10500
lambda_1: -0.3492, lambda_2: 97.0002 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.99 0.96 1.   0.99 0.73 0.57 0.24]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 1.0, 0.72, 0.56, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.92, 0.66, 0.37, 0.09]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111110110
1111111111111111111111111
1111111111111111111111111
1111111111111011110100000
1111100011011110111000000
1001100000101010000000000
loss: 0.080032, lagrangian_loss: 0.001015, attention_score_distillation_loss: 0.002486
ETA: 3:40:50 | Epoch 4 finished. Took 381.93 seconds.
loss: 0.021222, lagrangian_loss: 0.000672, attention_score_distillation_loss: 0.002489
----------------------------------------------------------------------
time: 2023-07-19 15:21:55
Evaluating: accuracy: 0.9255, eval_loss: 0.3376, token_prune_loc: [False, False, False, False, False, False, True, True, True, True], macs_sparsity: 0.2144, expected_sparsity: 0.2068, expected_sequence_sparsity: 0.6929, target_sparsity: 0.209, step: 11000
lambda_1: -0.2302, lambda_2: 104.3252 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.99 0.97 1.   0.98 0.73 0.46 0.2 ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.96, 0.72, 0.44, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.96, 0.69, 0.3, 0.06]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111110
1111111111111011110100000
1111100010010010111000000
1001100000101000000000000
loss: 0.221109, lagrangian_loss: 0.006300, attention_score_distillation_loss: 0.002147
loss: 0.023150, lagrangian_loss: 0.001558, attention_score_distillation_loss: 0.001901
----------------------------------------------------------------------
time: 2023-07-19 15:23:25
Evaluating: accuracy: 0.9255, eval_loss: 0.33, token_prune_loc: [False, False, False, False, False, False, True, True, True, True], macs_sparsity: 0.2233, expected_sparsity: 0.2111, expected_sequence_sparsity: 0.6946, target_sparsity: 0.2185, step: 11500
lambda_1: -0.5328, lambda_2: 114.2968 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.98 0.97 1.   0.97 0.73 0.39 0.16]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.96, 0.72, 0.4, 0.16]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.96, 0.69, 0.28, 0.04]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111110
1111111111111011110100000
1111100010010010110000000
1000100000100000000000001
loss: 0.009923, lagrangian_loss: -0.000323, attention_score_distillation_loss: 0.002611
loss: 0.070220, lagrangian_loss: 0.000519, attention_score_distillation_loss: 0.002231
----------------------------------------------------------------------
time: 2023-07-19 15:24:55
Evaluating: accuracy: 0.9243, eval_loss: 0.3453, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.2166, expected_sparsity: 0.2074, expected_sequence_sparsity: 0.6931, target_sparsity: 0.228, step: 12000
lambda_1: -0.5192, lambda_2: 127.7936 lambda_3: 0.0000
train remain: [0.99 0.99 0.99 0.98 0.96 0.99 0.98 0.72 0.38 0.16]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.36, 0.16]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72, 0.26, 0.04]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111011110100000
0111100010010010110000000
1000100000100000000000001
loss: 0.064717, lagrangian_loss: -0.000171, attention_score_distillation_loss: 0.002370
loss: 0.008864, lagrangian_loss: 0.001751, attention_score_distillation_loss: 0.002193
----------------------------------------------------------------------
time: 2023-07-19 15:26:25
Evaluating: accuracy: 0.9243, eval_loss: 0.3443, token_prune_loc: [False, False, False, False, True, False, True, True, True, True], macs_sparsity: 0.26, expected_sparsity: 0.249, expected_sequence_sparsity: 0.7095, target_sparsity: 0.2375, step: 12500
lambda_1: -0.3350, lambda_2: 138.8287 lambda_3: 0.0000
train remain: [0.99 0.99 0.98 0.98 0.96 0.99 0.96 0.69 0.35 0.15]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 0.92, 0.68, 0.36, 0.16]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.85, 0.58, 0.21, 0.03]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111110110
1111111111111111111111111
1111111111111111111011110
1111111111011011110100000
0111100010010010110000000
1000100000100000000001000
loss: 0.232064, lagrangian_loss: -0.000202, attention_score_distillation_loss: 0.001972
ETA: 3:34:27 | Epoch 5 finished. Took 377.78 seconds.
loss: 0.008599, lagrangian_loss: -0.000286, attention_score_distillation_loss: 0.002237
----------------------------------------------------------------------
time: 2023-07-19 15:27:55
Evaluating: accuracy: 0.9209, eval_loss: 0.3468, token_prune_loc: [False, False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2667, expected_sparsity: 0.2559, expected_sequence_sparsity: 0.7121, target_sparsity: 0.247, step: 13000
lambda_1: -0.2239, lambda_2: 151.3447 lambda_3: 0.0000
train remain: [1.   1.   1.   0.97 0.95 0.97 0.91 0.67 0.35 0.13]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 0.88, 0.68, 0.36, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.81, 0.55, 0.2, 0.02]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111110110
1111111111111111111111111
1111111111111111111011100
1111111111011011110100000
0111100010000010110001000
1000100000000000001000000
loss: 0.051538, lagrangian_loss: 0.003540, attention_score_distillation_loss: 0.002079
loss: 0.009448, lagrangian_loss: 0.000019, attention_score_distillation_loss: 0.001888
----------------------------------------------------------------------
time: 2023-07-19 15:29:25
Evaluating: accuracy: 0.9323, eval_loss: 0.3031, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2834, expected_sparsity: 0.274, expected_sequence_sparsity: 0.7192, target_sparsity: 0.2565, step: 13500
lambda_1: 0.0737, lambda_2: 163.3103 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.99 0.97 0.92 0.88 0.66 0.34 0.13]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.84, 0.68, 0.32, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.71, 0.48, 0.15, 0.02]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111110000
1111111111111110111011100
1111111111011011110100000
0111100010000010110000000
1000100000000000000000001
loss: 0.114277, lagrangian_loss: 0.001444, attention_score_distillation_loss: 0.001766
loss: 0.016524, lagrangian_loss: 0.001059, attention_score_distillation_loss: 0.001796
----------------------------------------------------------------------
time: 2023-07-19 15:30:55
Evaluating: accuracy: 0.9266, eval_loss: 0.3426, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3068, expected_sparsity: 0.2972, expected_sequence_sparsity: 0.7284, target_sparsity: 0.266, step: 14000
lambda_1: -0.3037, lambda_2: 174.0812 lambda_3: 0.0000
train remain: [1.   1.   1.   0.97 0.95 0.91 0.87 0.66 0.32 0.12]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.84, 0.84, 0.64, 0.32, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.77, 0.65, 0.42, 0.13, 0.02]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111110110
1111111111111111111110000
1111111111111110111011100
0111111111011011110100000
0111100010000010110000000
1000100000000000000000001
loss: 0.045118, lagrangian_loss: 0.002866, attention_score_distillation_loss: 0.001582
loss: 0.037236, lagrangian_loss: 0.008492, attention_score_distillation_loss: 0.001601
----------------------------------------------------------------------
time: 2023-07-19 15:32:25
Evaluating: accuracy: 0.9255, eval_loss: 0.3484, token_prune_loc: [False, False, False, False, False, False, True, True, True, True], macs_sparsity: 0.26, expected_sparsity: 0.2521, expected_sequence_sparsity: 0.7107, target_sparsity: 0.2755, step: 14500
lambda_1: -0.2446, lambda_2: 188.3178 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.94 0.97 0.93 0.82 0.65 0.28 0.13]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64, 0.28, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.51, 0.14, 0.02]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110111011000
0111111111011011110100000
0111100010000010100000000
1000101000000000000000000
loss: 0.045942, lagrangian_loss: 0.001020, attention_score_distillation_loss: 0.001363
ETA: 3:28:05 | Epoch 6 finished. Took 377.63 seconds.
loss: 0.311537, lagrangian_loss: 0.003741, attention_score_distillation_loss: 0.001749
----------------------------------------------------------------------
time: 2023-07-19 15:33:55
Evaluating: accuracy: 0.9209, eval_loss: 0.3631, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.3034, expected_sparsity: 0.2916, expected_sequence_sparsity: 0.7261, target_sparsity: 0.285, step: 15000
lambda_1: -0.0262, lambda_2: 200.2944 lambda_3: 0.0000
train remain: [1.   1.   1.   0.94 0.97 0.88 0.84 0.66 0.28 0.13]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.8, 0.64, 0.28, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64, 0.41, 0.11, 0.01]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111100000
1111111111111110111011000
0111111111011011110100000
0110100010000010100000100
1000100000000000000000001
loss: 0.070817, lagrangian_loss: 0.010246, attention_score_distillation_loss: 0.001247
loss: 0.007067, lagrangian_loss: 0.000312, attention_score_distillation_loss: 0.001414
----------------------------------------------------------------------
time: 2023-07-19 15:35:25
Evaluating: accuracy: 0.9243, eval_loss: 0.3757, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3301, expected_sparsity: 0.3178, expected_sequence_sparsity: 0.7364, target_sparsity: 0.2945, step: 15500
lambda_1: -0.2126, lambda_2: 214.0570 lambda_3: 0.0000
train remain: [0.99 0.99 1.   0.95 0.95 0.83 0.82 0.65 0.28 0.12]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.76, 0.8, 0.64, 0.28, 0.12]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.7, 0.56, 0.36, 0.1, 0.01]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111110110
1111111111111110111100000
1111111111111110111011000
0111111111011011110100000
1110100010000010100000000
1000100000100000000000000
loss: 0.050685, lagrangian_loss: 0.007757, attention_score_distillation_loss: 0.001605
loss: 0.030875, lagrangian_loss: -0.000290, attention_score_distillation_loss: 0.001264
----------------------------------------------------------------------
time: 2023-07-19 15:36:56
Evaluating: accuracy: 0.9278, eval_loss: 0.3263, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3301, expected_sparsity: 0.3184, expected_sequence_sparsity: 0.7366, target_sparsity: 0.304, step: 16000
lambda_1: -0.5446, lambda_2: 225.3073 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.94 0.93 0.81 0.82 0.64 0.28 0.09]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.76, 0.8, 0.64, 0.28, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.7, 0.56, 0.36, 0.1, 0.01]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111110110
1111111111111110111100000
1111111111111110111011000
0111111111011011110100000
1110100010000010100000000
1000100000000000000000000
loss: 0.040715, lagrangian_loss: 0.002310, attention_score_distillation_loss: 0.001337
loss: 0.041333, lagrangian_loss: 0.011079, attention_score_distillation_loss: 0.001058
----------------------------------------------------------------------
time: 2023-07-19 15:38:25
Evaluating: accuracy: 0.922, eval_loss: 0.3797, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.3101, expected_sparsity: 0.3014, expected_sequence_sparsity: 0.73, target_sparsity: 0.3135, step: 16500
lambda_1: -0.2081, lambda_2: 237.4823 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.92 0.94 0.78 0.82 0.62 0.26 0.09]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.8, 0.64, 0.24, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.61, 0.39, 0.09, 0.01]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110111100000
1111111111111110111011000
0111111111011011110100000
0110100010000010100000000
1000000000001000000000000
loss: 0.013925, lagrangian_loss: -0.000036, attention_score_distillation_loss: 0.001068
loss: 0.062891, lagrangian_loss: 0.000060, attention_score_distillation_loss: 0.001046
ETA: 3:21:45 | Epoch 7 finished. Took 377.88 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:39:55
Evaluating: accuracy: 0.9174, eval_loss: 0.3769, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3902, expected_sparsity: 0.3752, expected_sequence_sparsity: 0.7589, target_sparsity: 0.323, step: 17000
lambda_1: -0.3345, lambda_2: 250.9405 lambda_3: 0.0000
train remain: [1.   0.99 1.   0.91 0.93 0.77 0.82 0.62 0.24 0.09]
infer remain: [1.0, 1.0, 1.0, 0.84, 0.88, 0.76, 0.8, 0.6, 0.24, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.74, 0.56, 0.45, 0.27, 0.06, 0.01]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101111100
1111111111111111111110100
1111111111111110111100000
1111111111111110111011000
0111111111011010110100000
0110100010000010100000000
1000000010000000000000000
loss: 0.045489, lagrangian_loss: 0.000514, attention_score_distillation_loss: 0.001010
loss: 0.020274, lagrangian_loss: 0.002810, attention_score_distillation_loss: 0.000836
----------------------------------------------------------------------
time: 2023-07-19 15:41:25
Evaluating: accuracy: 0.9186, eval_loss: 0.3662, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3902, expected_sparsity: 0.3752, expected_sequence_sparsity: 0.7589, target_sparsity: 0.3325, step: 17500
lambda_1: -0.3834, lambda_2: 262.7237 lambda_3: 0.0000
train remain: [0.99 0.99 1.   0.88 0.91 0.77 0.83 0.63 0.25 0.09]
infer remain: [1.0, 1.0, 1.0, 0.84, 0.88, 0.76, 0.8, 0.6, 0.24, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.74, 0.56, 0.45, 0.27, 0.06, 0.01]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111111101101100
1111111111111111111110100
1111111111111110111100000
1111111111111110111011000
0111111111011010110100000
1110100010000010000000000
1000000010000000000000000
loss: 0.008387, lagrangian_loss: 0.000690, attention_score_distillation_loss: 0.000897
loss: 0.025521, lagrangian_loss: 0.001207, attention_score_distillation_loss: 0.000878
----------------------------------------------------------------------
time: 2023-07-19 15:42:55
Evaluating: accuracy: 0.9163, eval_loss: 0.4031, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4036, expected_sparsity: 0.3935, expected_sequence_sparsity: 0.7661, target_sparsity: 0.342, step: 18000
lambda_1: -0.2471, lambda_2: 276.1984 lambda_3: 0.0000
train remain: [1.   0.99 1.   0.88 0.87 0.77 0.82 0.62 0.24 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.84, 0.76, 0.8, 0.6, 0.24, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.67, 0.51, 0.41, 0.25, 0.06, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111111111110000
1111111111111110101110000
1111111111111110111011000
0111111111011010110100000
1110100010000010000000000
1000000000001000000000000
loss: 0.013237, lagrangian_loss: 0.013013, attention_score_distillation_loss: 0.000661
loss: 0.703443, lagrangian_loss: 0.062650, attention_score_distillation_loss: 0.000531
----------------------------------------------------------------------
time: 2023-07-19 15:44:24
Evaluating: accuracy: 0.9186, eval_loss: 0.3741, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4102, expected_sparsity: 0.3995, expected_sequence_sparsity: 0.7684, target_sparsity: 0.3515, step: 18500
lambda_1: -0.7234, lambda_2: 288.0484 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.85 0.87 0.77 0.82 0.62 0.22 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.84, 0.72, 0.8, 0.6, 0.2, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.67, 0.48, 0.39, 0.23, 0.05, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111111111110000
1111111111111110101100000
1111111111111110111011000
0111111111011010110100000
0110100010000010000000000
1000000000000000000000001
loss: 0.010800, lagrangian_loss: 0.000563, attention_score_distillation_loss: 0.000612
loss: 0.077505, lagrangian_loss: 0.009510, attention_score_distillation_loss: 0.000497
ETA: 3:15:22 | Epoch 8 finished. Took 377.04 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:45:54
Evaluating: accuracy: 0.914, eval_loss: 0.4015, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4102, expected_sparsity: 0.3995, expected_sequence_sparsity: 0.7684, target_sparsity: 0.361, step: 19000
lambda_1: -0.7331, lambda_2: 300.5385 lambda_3: 0.0000
train remain: [0.99 1.   1.   0.83 0.86 0.74 0.82 0.62 0.22 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.84, 0.72, 0.8, 0.6, 0.2, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.67, 0.48, 0.39, 0.23, 0.05, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111111111110000
1111111111111110101100000
1111111111111110111011000
0111111111011010110100000
0110100010000000001000000
1000000000000000000000001
loss: 0.007223, lagrangian_loss: 0.016090, attention_score_distillation_loss: 0.000555
loss: 0.011439, lagrangian_loss: 0.000215, attention_score_distillation_loss: 0.000448
Starting saving the best from epoch 9 and step 19500
----------------------------------------------------------------------
time: 2023-07-19 15:47:24
Evaluating: accuracy: 0.9117, eval_loss: 0.3985, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4202, expected_sparsity: 0.4072, expected_sequence_sparsity: 0.7714, target_sparsity: 0.3705, step: 19500
lambda_1: -0.6042, lambda_2: 313.5609 lambda_3: 0.0000
train remain: [1.   0.99 1.   0.82 0.82 0.74 0.81 0.62 0.17 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.8, 0.72, 0.8, 0.6, 0.16, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64, 0.46, 0.37, 0.22, 0.04, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110111110000
1111111111111110101100000
1111111111111110111011000
0111111111011010110100000
0010100010000000001000000
1000000000001000000000000
Saving the best model so far: [Epoch 9 | Step: 19500 | MACs sparsity: 0.4202 | Score: 0.9117 | Loss: 0.3985]
loss: 0.021915, lagrangian_loss: 0.071607, attention_score_distillation_loss: 0.000287
loss: 0.019239, lagrangian_loss: 0.002265, attention_score_distillation_loss: 0.000318
----------------------------------------------------------------------
time: 2023-07-19 15:49:51
Evaluating: accuracy: 0.9128, eval_loss: 0.396, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4269, expected_sparsity: 0.414, expected_sequence_sparsity: 0.7741, target_sparsity: 0.38, step: 20000
lambda_1: -0.7261, lambda_2: 326.3415 lambda_3: 0.0000
train remain: [0.99 0.99 1.   0.82 0.78 0.74 0.81 0.62 0.17 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.76, 0.72, 0.8, 0.6, 0.16, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.61, 0.44, 0.35, 0.21, 0.03, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101110000
1111111111111110101100000
1111111111111110111011000
0111111111011010110100000
0010100010000000001000000
1000000000000000000000001
Best eval score so far: 0.9117 @ step 19500 epoch 9.26
Saving the best model so far: [Epoch 9 | Step: 20000 | MACs sparsity: 0.4269 | Score: 0.9128 | Loss: 0.396]
loss: 0.008656, lagrangian_loss: 0.004315, attention_score_distillation_loss: 0.000304
loss: 0.145286, lagrangian_loss: 0.003923, attention_score_distillation_loss: 0.000208
----------------------------------------------------------------------
time: 2023-07-19 15:51:52
Evaluating: accuracy: 0.9071, eval_loss: 0.4292, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4269, expected_sparsity: 0.414, expected_sequence_sparsity: 0.7741, target_sparsity: 0.3895, step: 20500
lambda_1: -1.2951, lambda_2: 338.3936 lambda_3: 0.0000
train remain: [0.99 0.98 1.   0.81 0.77 0.73 0.81 0.61 0.16 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.76, 0.72, 0.8, 0.6, 0.16, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.61, 0.44, 0.35, 0.21, 0.03, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101110000
1111111111111110101100000
1111111111111110111011000
0111111111011010110100000
0010100010000000001000000
1000000000000000000000001
Best eval score so far: 0.9128 @ step 20000 epoch 9.50
loss: 0.225192, lagrangian_loss: -0.000559, attention_score_distillation_loss: 0.000140
loss: 0.015227, lagrangian_loss: 0.002478, attention_score_distillation_loss: 0.000064
----------------------------------------------------------------------
time: 2023-07-19 15:53:21
Evaluating: accuracy: 0.9197, eval_loss: 0.3732, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4269, expected_sparsity: 0.4148, expected_sequence_sparsity: 0.7744, target_sparsity: 0.399, step: 21000
lambda_1: -2.0383, lambda_2: 350.4059 lambda_3: 0.0000
train remain: [0.99 0.97 0.99 0.81 0.77 0.73 0.8  0.62 0.13 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.76, 0.72, 0.8, 0.6, 0.12, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.61, 0.44, 0.35, 0.21, 0.03, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101110000
1111111111111110101100000
1111111111111110111011000
0111111111011010110100000
0010100010000000000000000
1000000000000000000000001
Best eval score so far: 0.9128 @ step 20000 epoch 9.50
Saving the best model so far: [Epoch 9 | Step: 21000 | MACs sparsity: 0.4269 | Score: 0.9197 | Loss: 0.3732]
loss: 0.007518, lagrangian_loss: 0.061964, attention_score_distillation_loss: 0.000040
ETA: 3:15:03 | Epoch 9 finished. Took 497.94 seconds.
loss: 0.339028, lagrangian_loss: 0.008784, attention_score_distillation_loss: 0.000060
----------------------------------------------------------------------
time: 2023-07-19 15:55:18
Evaluating: accuracy: 0.9071, eval_loss: 0.4127, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4335, expected_sparsity: 0.4171, expected_sequence_sparsity: 0.7753, target_sparsity: 0.4, step: 21500
lambda_1: -0.6656, lambda_2: 363.0825 lambda_3: 0.0000
train remain: [0.99 0.96 0.99 0.81 0.76 0.73 0.77 0.62 0.13 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.76, 0.72, 0.76, 0.6, 0.12, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.61, 0.44, 0.33, 0.2, 0.02, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101110000
1111111111111110101100000
1111101111111110111011000
0111111111011010110100000
0010100001000000000000000
1000000010000000000000000
Best eval score so far: 0.9197 @ step 21000 epoch 9.98
loss: 0.070387, lagrangian_loss: 0.002247, attention_score_distillation_loss: 0.000052
loss: 0.008773, lagrangian_loss: 0.002919, attention_score_distillation_loss: 0.000054
----------------------------------------------------------------------
time: 2023-07-19 15:56:48
Evaluating: accuracy: 0.9174, eval_loss: 0.4149, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4226, expected_sequence_sparsity: 0.7775, target_sparsity: 0.4, step: 22000
lambda_1: -0.4025, lambda_2: 374.1867 lambda_3: 0.0000
train remain: [0.98 0.97 1.   0.81 0.74 0.73 0.77 0.62 0.13 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.64, 0.12, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.2, 0.02, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111110111011000
0111111111011010110110000
0010101000000000000000000
1000000000001000000000000
Best eval score so far: 0.9197 @ step 21000 epoch 9.98
loss: 0.009230, lagrangian_loss: 0.012265, attention_score_distillation_loss: 0.000046
loss: 0.021243, lagrangian_loss: 0.014777, attention_score_distillation_loss: 0.000057
----------------------------------------------------------------------
time: 2023-07-19 15:58:18
Evaluating: accuracy: 0.9209, eval_loss: 0.387, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4238, expected_sequence_sparsity: 0.778, target_sparsity: 0.4, step: 22500
lambda_1: -0.4009, lambda_2: 385.9117 lambda_3: 0.0000
train remain: [0.99 0.97 0.99 0.82 0.74 0.74 0.77 0.62 0.13 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.6, 0.12, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.19, 0.02, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111110111011000
0111111111011010110100000
0010101000000000000000000
1000000000000000000000010
Best eval score so far: 0.9197 @ step 21000 epoch 9.98
Saving the best model so far: [Epoch 10 | Step: 22500 | MACs sparsity: 0.4402 | Score: 0.9209 | Loss: 0.387]
loss: 0.007688, lagrangian_loss: -0.000085, attention_score_distillation_loss: 0.000051
loss: 0.334175, lagrangian_loss: 0.024917, attention_score_distillation_loss: 0.000045
----------------------------------------------------------------------
time: 2023-07-19 16:00:22
Evaluating: accuracy: 0.9186, eval_loss: 0.3972, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4238, expected_sequence_sparsity: 0.778, target_sparsity: 0.4, step: 23000
lambda_1: -0.4820, lambda_2: 396.7532 lambda_3: 0.0000
train remain: [0.97 0.99 1.   0.82 0.74 0.74 0.77 0.61 0.13 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.6, 0.12, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.19, 0.02, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111110111011000
0111111111011010110100000
0010101000000000000000000
1000000000000000100000000
Best eval score so far: 0.9209 @ step 22500 epoch 10.69
loss: 0.019337, lagrangian_loss: 0.029455, attention_score_distillation_loss: 0.000044
ETA: 3:09:27 | Epoch 10 finished. Took 410.65 seconds.
loss: 0.028104, lagrangian_loss: -0.004824, attention_score_distillation_loss: 0.000049
----------------------------------------------------------------------
time: 2023-07-19 16:01:52
Evaluating: accuracy: 0.922, eval_loss: 0.3689, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4244, expected_sequence_sparsity: 0.7782, target_sparsity: 0.4, step: 23500
lambda_1: -0.5830, lambda_2: 409.2549 lambda_3: 0.0000
train remain: [0.98 0.99 0.99 0.81 0.74 0.73 0.77 0.6  0.09 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.6, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.19, 0.02, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111110111011000
0111111111011010110100000
0010100000000000000000000
1000000000001000000000000
Best eval score so far: 0.9209 @ step 22500 epoch 10.69
Saving the best model so far: [Epoch 11 | Step: 23500 | MACs sparsity: 0.4402 | Score: 0.922 | Loss: 0.3689]
loss: 0.009476, lagrangian_loss: 0.016693, attention_score_distillation_loss: 0.000052
loss: 0.005693, lagrangian_loss: 0.006703, attention_score_distillation_loss: 0.000047
----------------------------------------------------------------------
time: 2023-07-19 16:03:57
Evaluating: accuracy: 0.9151, eval_loss: 0.3836, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4244, expected_sequence_sparsity: 0.7782, target_sparsity: 0.4, step: 24000
lambda_1: -0.2180, lambda_2: 420.8145 lambda_3: 0.0000
train remain: [0.98 0.99 0.98 0.81 0.74 0.74 0.77 0.61 0.1  0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.6, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.19, 0.02, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111110111011000
0111111111011010110100000
0000100000000000001000000
1000000000001000000000000
Best eval score so far: 0.9220 @ step 23500 epoch 11.16
loss: 0.062661, lagrangian_loss: 0.017847, attention_score_distillation_loss: 0.000057
loss: 0.241612, lagrangian_loss: 0.015180, attention_score_distillation_loss: 0.000053
----------------------------------------------------------------------
time: 2023-07-19 16:05:27
Evaluating: accuracy: 0.9186, eval_loss: 0.3978, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4244, expected_sequence_sparsity: 0.7782, target_sparsity: 0.4, step: 24500
lambda_1: -1.9868, lambda_2: 432.3417 lambda_3: 0.0000
train remain: [0.99 0.99 0.97 0.82 0.73 0.73 0.77 0.58 0.09 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.6, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.19, 0.02, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111110111011000
0111111111011010110100000
0000100000000000001000000
1000000100000000000000000
Best eval score so far: 0.9220 @ step 23500 epoch 11.16
loss: 0.012182, lagrangian_loss: 0.006213, attention_score_distillation_loss: 0.000048
loss: 0.014083, lagrangian_loss: 0.008984, attention_score_distillation_loss: 0.000051
----------------------------------------------------------------------
time: 2023-07-19 16:06:57
Evaluating: accuracy: 0.9163, eval_loss: 0.4165, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4244, expected_sequence_sparsity: 0.7782, target_sparsity: 0.4, step: 25000
lambda_1: -0.0343, lambda_2: 443.8658 lambda_3: 0.0000
train remain: [0.99 0.99 0.98 0.82 0.74 0.74 0.77 0.61 0.1  0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.6, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.19, 0.02, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111110111011000
0111111111011010110100000
0000100000000000000000100
1000000000000000000100000
Best eval score so far: 0.9220 @ step 23500 epoch 11.16
loss: 0.039797, lagrangian_loss: 0.001821, attention_score_distillation_loss: 0.000053
loss: 0.005244, lagrangian_loss: 0.019639, attention_score_distillation_loss: 0.000046
ETA: 3:03:43 | Epoch 11 finished. Took 412.35 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:08:27
Evaluating: accuracy: 0.9197, eval_loss: 0.3906, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4244, expected_sequence_sparsity: 0.7782, target_sparsity: 0.4, step: 25500
lambda_1: -0.2721, lambda_2: 455.2599 lambda_3: 0.0000
train remain: [0.98 0.98 0.99 0.82 0.74 0.76 0.77 0.6  0.1  0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.6, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.19, 0.02, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111110111011000
0111111111011010110100000
0000100000000010000000000
1000000010000000000000000
Best eval score so far: 0.9220 @ step 23500 epoch 11.16
loss: 0.006331, lagrangian_loss: 0.031499, attention_score_distillation_loss: 0.000042
loss: 0.245801, lagrangian_loss: 0.017978, attention_score_distillation_loss: 0.000054
----------------------------------------------------------------------
time: 2023-07-19 16:09:56
Evaluating: accuracy: 0.9186, eval_loss: 0.4125, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4255, expected_sequence_sparsity: 0.7786, target_sparsity: 0.4, step: 26000
lambda_1: -0.3610, lambda_2: 466.5440 lambda_3: 0.0000
train remain: [0.99 0.97 0.99 0.81 0.76 0.74 0.77 0.58 0.09 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.56, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.18, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111110111011000
0101111111011010110100000
0000100000001000000000000
1000000000000000000000001
Best eval score so far: 0.9220 @ step 23500 epoch 11.16
loss: 0.020716, lagrangian_loss: 0.012093, attention_score_distillation_loss: 0.000058
loss: 0.037960, lagrangian_loss: 0.028970, attention_score_distillation_loss: 0.000062
----------------------------------------------------------------------
time: 2023-07-19 16:11:26
Evaluating: accuracy: 0.9243, eval_loss: 0.365, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4255, expected_sequence_sparsity: 0.7786, target_sparsity: 0.4, step: 26500
lambda_1: -0.6528, lambda_2: 477.6496 lambda_3: 0.0000
train remain: [0.99 0.96 0.99 0.81 0.75 0.74 0.76 0.58 0.09 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.56, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.18, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111110111011000
0101111111011010110100000
0000101000000000000000000
1000000000000000000000001
Best eval score so far: 0.9220 @ step 23500 epoch 11.16
Saving the best model so far: [Epoch 12 | Step: 26500 | MACs sparsity: 0.4402 | Score: 0.9243 | Loss: 0.365]
loss: 0.028442, lagrangian_loss: 0.006984, attention_score_distillation_loss: 0.000051
loss: 0.005528, lagrangian_loss: 0.036418, attention_score_distillation_loss: 0.000037
----------------------------------------------------------------------
time: 2023-07-19 16:13:24
Evaluating: accuracy: 0.9197, eval_loss: 0.3721, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4249, expected_sequence_sparsity: 0.7784, target_sparsity: 0.4, step: 27000
lambda_1: -0.2567, lambda_2: 488.4018 lambda_3: 0.0000
train remain: [0.99 0.97 0.99 0.81 0.76 0.74 0.75 0.58 0.11 0.1 ]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.56, 0.12, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.18, 0.02, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111110111011000
0101111111011010110100000
0000101000001000000000000
1000000100000000000000000
Best eval score so far: 0.9243 @ step 26500 epoch 12.59
loss: 0.005848, lagrangian_loss: 0.004468, attention_score_distillation_loss: 0.000057
loss: 0.008393, lagrangian_loss: 0.000322, attention_score_distillation_loss: 0.000051
ETA: 2:57:33 | Epoch 12 finished. Took 405.04 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:14:54
Evaluating: accuracy: 0.922, eval_loss: 0.3892, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4255, expected_sequence_sparsity: 0.7786, target_sparsity: 0.4, step: 27500
lambda_1: -1.2530, lambda_2: 500.5627 lambda_3: 0.0000
train remain: [0.99 0.97 0.99 0.81 0.76 0.72 0.75 0.57 0.09 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.76, 0.56, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.32, 0.18, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111110111011000
0101111111011010110100000
0000100000001000000000000
1000000000000000000000001
Best eval score so far: 0.9243 @ step 26500 epoch 12.59
loss: 0.002942, lagrangian_loss: 0.000287, attention_score_distillation_loss: 0.000047
loss: 0.022960, lagrangian_loss: 0.082235, attention_score_distillation_loss: 0.000041
----------------------------------------------------------------------
time: 2023-07-19 16:16:23
Evaluating: accuracy: 0.9255, eval_loss: 0.382, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4276, expected_sequence_sparsity: 0.7795, target_sparsity: 0.4, step: 28000
lambda_1: -0.5438, lambda_2: 511.3359 lambda_3: 0.0000
train remain: [0.99 0.98 1.   0.81 0.76 0.73 0.73 0.57 0.1  0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.56, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.17, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111010111011000
0101111111011010110100000
0000100000001000000000000
1000000000000000000000001
Best eval score so far: 0.9243 @ step 26500 epoch 12.59
Saving the best model so far: [Epoch 13 | Step: 28000 | MACs sparsity: 0.4402 | Score: 0.9255 | Loss: 0.382]
loss: 0.001618, lagrangian_loss: 0.024870, attention_score_distillation_loss: 0.000058
loss: 0.020451, lagrangian_loss: 0.000290, attention_score_distillation_loss: 0.000053
----------------------------------------------------------------------
time: 2023-07-19 16:18:26
Evaluating: accuracy: 0.9289, eval_loss: 0.3622, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4276, expected_sequence_sparsity: 0.7795, target_sparsity: 0.4, step: 28500
lambda_1: -0.1599, lambda_2: 521.5013 lambda_3: 0.0000
train remain: [0.99 0.97 1.   0.81 0.76 0.73 0.73 0.57 0.09 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.56, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.17, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111010111011000
0101111111011010110100000
0000100000001000000000000
1000000010000000000000000
Best eval score so far: 0.9255 @ step 28000 epoch 13.30
Saving the best model so far: [Epoch 13 | Step: 28500 | MACs sparsity: 0.4402 | Score: 0.9289 | Loss: 0.3622]
loss: 0.004615, lagrangian_loss: 0.019029, attention_score_distillation_loss: 0.000057
loss: 0.012880, lagrangian_loss: 0.015729, attention_score_distillation_loss: 0.000048
----------------------------------------------------------------------
time: 2023-07-19 16:20:24
Evaluating: accuracy: 0.922, eval_loss: 0.3981, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4276, expected_sequence_sparsity: 0.7795, target_sparsity: 0.4, step: 29000
lambda_1: -0.4209, lambda_2: 532.8297 lambda_3: 0.0000
train remain: [0.99 0.97 0.99 0.82 0.75 0.74 0.74 0.56 0.09 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.56, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.17, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111010111011000
0101111111011010110100000
0000100000000000001000000
1000000000000000000000001
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.039566, lagrangian_loss: 0.010481, attention_score_distillation_loss: 0.000053
loss: 0.004200, lagrangian_loss: 0.004085, attention_score_distillation_loss: 0.000056
ETA: 2:52:18 | Epoch 13 finished. Took 437.47 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:21:53
Evaluating: accuracy: 0.9186, eval_loss: 0.4071, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4276, expected_sequence_sparsity: 0.7795, target_sparsity: 0.4, step: 29500
lambda_1: -0.8445, lambda_2: 544.2328 lambda_3: 0.0000
train remain: [0.99 0.98 0.99 0.81 0.74 0.74 0.74 0.55 0.09 0.09]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.56, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.17, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111010111011000
0101111111011010110100000
0000100000001000000000000
1000000010000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.065120, lagrangian_loss: 0.002955, attention_score_distillation_loss: 0.000054
loss: 0.006705, lagrangian_loss: 0.060361, attention_score_distillation_loss: 0.000042
----------------------------------------------------------------------
time: 2023-07-19 16:23:23
Evaluating: accuracy: 0.9232, eval_loss: 0.3768, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4276, expected_sequence_sparsity: 0.7795, target_sparsity: 0.4, step: 30000
lambda_1: -0.4613, lambda_2: 555.9514 lambda_3: 0.0000
train remain: [0.99 0.97 0.99 0.83 0.74 0.74 0.73 0.54 0.09 0.08]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.56, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.17, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111010111011000
0101111111011010110100000
0000100000000000001000000
1000000000000000000000001
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.004798, lagrangian_loss: 0.016233, attention_score_distillation_loss: 0.000042
loss: 0.017283, lagrangian_loss: 0.006262, attention_score_distillation_loss: 0.000055
----------------------------------------------------------------------
time: 2023-07-19 16:24:52
Evaluating: accuracy: 0.9278, eval_loss: 0.3585, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4276, expected_sequence_sparsity: 0.7795, target_sparsity: 0.4, step: 30500
lambda_1: -0.5198, lambda_2: 567.5977 lambda_3: 0.0000
train remain: [0.99 0.97 0.99 0.82 0.75 0.74 0.74 0.55 0.09 0.08]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.56, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.17, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111010111011000
0101111111011010110100000
0000101000000000000000000
1000000010000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.010632, lagrangian_loss: 0.006477, attention_score_distillation_loss: 0.000051
loss: 0.014309, lagrangian_loss: 0.002179, attention_score_distillation_loss: 0.000059
----------------------------------------------------------------------
time: 2023-07-19 16:26:22
Evaluating: accuracy: 0.9243, eval_loss: 0.3599, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4276, expected_sequence_sparsity: 0.7795, target_sparsity: 0.4, step: 31000
lambda_1: -0.4821, lambda_2: 579.8088 lambda_3: 0.0000
train remain: [0.99 0.98 0.99 0.82 0.74 0.74 0.74 0.54 0.1  0.08]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.56, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.17, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111010111011000
0101111111011010110100000
0000100010000000000000000
1000000000000000010000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.015730, lagrangian_loss: 0.107058, attention_score_distillation_loss: 0.000042
loss: 0.002579, lagrangian_loss: 0.005024, attention_score_distillation_loss: 0.000053
----------------------------------------------------------------------
time: 2023-07-19 16:27:52
Evaluating: accuracy: 0.9186, eval_loss: 0.3836, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4435, expected_sparsity: 0.4287, expected_sequence_sparsity: 0.7799, target_sparsity: 0.4, step: 31500
lambda_1: -1.0087, lambda_2: 592.1707 lambda_3: 0.0000
train remain: [0.99 0.98 0.99 0.82 0.74 0.73 0.73 0.54 0.09 0.06]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.16, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111010111011000
0101101111011010110100000
0000100010000000000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.005703, lagrangian_loss: 0.003289, attention_score_distillation_loss: 0.000048
ETA: 2:45:13 | Epoch 14 finished. Took 381.35 seconds.
loss: 0.007816, lagrangian_loss: 0.074306, attention_score_distillation_loss: 0.000041
----------------------------------------------------------------------
time: 2023-07-19 16:29:22
Evaluating: accuracy: 0.9128, eval_loss: 0.4007, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4435, expected_sparsity: 0.4287, expected_sequence_sparsity: 0.7799, target_sparsity: 0.4, step: 32000
lambda_1: -0.5498, lambda_2: 603.1637 lambda_3: 0.0000
train remain: [0.99 0.99 0.99 0.81 0.74 0.73 0.73 0.54 0.09 0.06]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.16, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111010111011000
0101101111011010110100000
0000100000000010000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.007923, lagrangian_loss: 0.008814, attention_score_distillation_loss: 0.000053
loss: 0.019591, lagrangian_loss: 0.008428, attention_score_distillation_loss: 0.000044
----------------------------------------------------------------------
time: 2023-07-19 16:30:52
Evaluating: accuracy: 0.9151, eval_loss: 0.3968, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4402, expected_sparsity: 0.4276, expected_sequence_sparsity: 0.7795, target_sparsity: 0.4, step: 32500
lambda_1: -0.2055, lambda_2: 616.0857 lambda_3: 0.0000
train remain: [0.98 0.99 1.   0.82 0.74 0.73 0.73 0.54 0.1  0.07]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.56, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.17, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111010111011000
0101101111011010110100010
0000100100000000000000000
1000000000000000100000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.015917, lagrangian_loss: 0.148786, attention_score_distillation_loss: 0.000040
loss: 0.012037, lagrangian_loss: -0.000045, attention_score_distillation_loss: 0.000047
----------------------------------------------------------------------
time: 2023-07-19 16:32:22
Evaluating: accuracy: 0.9232, eval_loss: 0.3708, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4435, expected_sparsity: 0.4287, expected_sequence_sparsity: 0.7799, target_sparsity: 0.4, step: 33000
lambda_1: -0.6351, lambda_2: 627.1347 lambda_3: 0.0000
train remain: [0.99 0.99 1.   0.81 0.74 0.73 0.72 0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.72, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.3, 0.16, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111010111011000
0101101111011010110100000
1000100000000000000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.009967, lagrangian_loss: 0.004021, attention_score_distillation_loss: 0.000046
loss: 0.010388, lagrangian_loss: 0.000199, attention_score_distillation_loss: 0.000049
----------------------------------------------------------------------
time: 2023-07-19 16:33:51
Evaluating: accuracy: 0.9186, eval_loss: 0.3989, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4308, expected_sequence_sparsity: 0.7807, target_sparsity: 0.4, step: 33500
lambda_1: -0.8868, lambda_2: 638.0287 lambda_3: 0.0000
train remain: [0.99 0.99 0.99 0.81 0.74 0.71 0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.28, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111010111010000
0101101111011010110100000
0010100000000000000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.018070, lagrangian_loss: 0.002462, attention_score_distillation_loss: 0.000051
ETA: 2:38:07 | Epoch 15 finished. Took 376.89 seconds.
loss: 0.111439, lagrangian_loss: 0.000197, attention_score_distillation_loss: 0.000052
----------------------------------------------------------------------
time: 2023-07-19 16:35:21
Evaluating: accuracy: 0.9186, eval_loss: 0.406, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 34000
lambda_1: -0.6912, lambda_2: 648.1470 lambda_3: 0.0000
train remain: [0.99 0.99 0.99 0.81 0.74 0.7  0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000000000001000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.008511, lagrangian_loss: 0.021982, attention_score_distillation_loss: 0.000044
loss: 0.251608, lagrangian_loss: 0.000294, attention_score_distillation_loss: 0.000053
----------------------------------------------------------------------
time: 2023-07-19 16:36:51
Evaluating: accuracy: 0.922, eval_loss: 0.3852, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4308, expected_sequence_sparsity: 0.7807, target_sparsity: 0.4, step: 34500
lambda_1: -0.3695, lambda_2: 659.3114 lambda_3: 0.0000
train remain: [0.99 0.99 0.99 0.82 0.75 0.71 0.68 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.72, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.41, 0.28, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111111110101100000
1111101111111010111010000
0101101111011010110100000
0000101000000000000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.007828, lagrangian_loss: 0.000249, attention_score_distillation_loss: 0.000050
loss: 0.008113, lagrangian_loss: 0.021393, attention_score_distillation_loss: 0.000057
----------------------------------------------------------------------
time: 2023-07-19 16:38:20
Evaluating: accuracy: 0.9232, eval_loss: 0.3906, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 35000
lambda_1: -0.4315, lambda_2: 671.6459 lambda_3: 0.0000
train remain: [0.99 0.99 1.   0.82 0.74 0.7  0.68 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000000000000000001
0000000000000000000000001
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.041162, lagrangian_loss: -0.000061, attention_score_distillation_loss: 0.000049
loss: 0.007525, lagrangian_loss: 0.020272, attention_score_distillation_loss: 0.000047
----------------------------------------------------------------------
time: 2023-07-19 16:39:51
Evaluating: accuracy: 0.9232, eval_loss: 0.3764, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 35500
lambda_1: -0.1752, lambda_2: 681.8187 lambda_3: 0.0000
train remain: [0.99 0.99 1.   0.82 0.75 0.7  0.7  0.54 0.09 0.06]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
0000100000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.005210, lagrangian_loss: 0.039003, attention_score_distillation_loss: 0.000046
loss: 0.004421, lagrangian_loss: 0.109581, attention_score_distillation_loss: 0.000043
ETA: 2:31:08 | Epoch 16 finished. Took 377.56 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:41:21
Evaluating: accuracy: 0.9163, eval_loss: 0.3986, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 36000
lambda_1: -0.4046, lambda_2: 694.4706 lambda_3: 0.0000
train remain: [0.99 0.99 0.99 0.82 0.75 0.7  0.69 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100001000000000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.002975, lagrangian_loss: 0.000761, attention_score_distillation_loss: 0.000053
loss: 0.004696, lagrangian_loss: 0.020834, attention_score_distillation_loss: 0.000044
----------------------------------------------------------------------
time: 2023-07-19 16:42:50
Evaluating: accuracy: 0.9197, eval_loss: 0.3901, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 36500
lambda_1: -0.2002, lambda_2: 705.3948 lambda_3: 0.0000
train remain: [1.   0.99 0.99 0.81 0.75 0.71 0.69 0.53 0.09 0.06]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000001000000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.107541, lagrangian_loss: 0.002262, attention_score_distillation_loss: 0.000057
loss: 0.057680, lagrangian_loss: 0.000653, attention_score_distillation_loss: 0.000045
----------------------------------------------------------------------
time: 2023-07-19 16:44:20
Evaluating: accuracy: 0.9197, eval_loss: 0.3788, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 37000
lambda_1: -0.2664, lambda_2: 717.4370 lambda_3: 0.0000
train remain: [1.   0.99 1.   0.8  0.74 0.71 0.69 0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000001000000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.007292, lagrangian_loss: 0.048426, attention_score_distillation_loss: 0.000054
loss: 0.012846, lagrangian_loss: 0.000129, attention_score_distillation_loss: 0.000050
----------------------------------------------------------------------
time: 2023-07-19 16:45:50
Evaluating: accuracy: 0.9243, eval_loss: 0.3744, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 37500
lambda_1: -0.3495, lambda_2: 729.8696 lambda_3: 0.0000
train remain: [1.   0.99 1.   0.8  0.75 0.7  0.69 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100001000000000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.003568, lagrangian_loss: 0.027161, attention_score_distillation_loss: 0.000041
loss: 0.006349, lagrangian_loss: 0.010191, attention_score_distillation_loss: 0.000048
ETA: 2:24:13 | Epoch 17 finished. Took 377.46 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:47:20
Evaluating: accuracy: 0.9186, eval_loss: 0.3869, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 38000
lambda_1: -0.2518, lambda_2: 740.6155 lambda_3: 0.0000
train remain: [1.   0.99 0.99 0.8  0.75 0.7  0.68 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
0000000000001000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.006214, lagrangian_loss: 0.073724, attention_score_distillation_loss: 0.000041
loss: 0.010631, lagrangian_loss: 0.011686, attention_score_distillation_loss: 0.000052
----------------------------------------------------------------------
time: 2023-07-19 16:48:50
Evaluating: accuracy: 0.9197, eval_loss: 0.3929, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 38500
lambda_1: -0.0965, lambda_2: 751.6179 lambda_3: 0.0000
train remain: [1.   0.99 0.99 0.81 0.77 0.7  0.69 0.54 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
0000000000000000000000001
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.009295, lagrangian_loss: 0.029498, attention_score_distillation_loss: 0.000062
loss: 0.015984, lagrangian_loss: 0.034771, attention_score_distillation_loss: 0.000054
----------------------------------------------------------------------
time: 2023-07-19 16:50:20
Evaluating: accuracy: 0.922, eval_loss: 0.4008, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4435, expected_sparsity: 0.4285, expected_sequence_sparsity: 0.7798, target_sparsity: 0.4, step: 39000
lambda_1: -0.1759, lambda_2: 763.9257 lambda_3: 0.0000
train remain: [1.   0.99 0.99 0.81 0.77 0.7  0.7  0.54 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.76, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.61, 0.41, 0.28, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111111101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000101000000000000000000
0000000000000000100000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.238435, lagrangian_loss: 0.000540, attention_score_distillation_loss: 0.000055
loss: 0.009482, lagrangian_loss: 0.359404, attention_score_distillation_loss: 0.000033
----------------------------------------------------------------------
time: 2023-07-19 16:51:49
Evaluating: accuracy: 0.9197, eval_loss: 0.3931, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4469, expected_sparsity: 0.4345, expected_sequence_sparsity: 0.7822, target_sparsity: 0.4, step: 39500
lambda_1: -0.1383, lambda_2: 775.9684 lambda_3: 0.0000
train remain: [1.   0.99 1.   0.8  0.77 0.7  0.69 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.8, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101100
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000001000000000000
0000000000000000100000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.006326, lagrangian_loss: 0.005719, attention_score_distillation_loss: 0.000049
loss: 0.005775, lagrangian_loss: 0.011350, attention_score_distillation_loss: 0.000054
ETA: 2:17:22 | Epoch 18 finished. Took 377.73 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:53:20
Evaluating: accuracy: 0.9209, eval_loss: 0.3899, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4502, expected_sparsity: 0.4374, expected_sequence_sparsity: 0.7833, target_sparsity: 0.4, step: 40000
lambda_1: -0.3434, lambda_2: 787.0940 lambda_3: 0.0000
train remain: [0.99 0.99 1.   0.78 0.79 0.7  0.69 0.53 0.1  0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.76, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101110000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
0000000000001000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.011044, lagrangian_loss: 0.042601, attention_score_distillation_loss: 0.000048
loss: 0.014113, lagrangian_loss: 0.003574, attention_score_distillation_loss: 0.000048
----------------------------------------------------------------------
time: 2023-07-19 16:54:50
Evaluating: accuracy: 0.9174, eval_loss: 0.3861, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4502, expected_sparsity: 0.4374, expected_sequence_sparsity: 0.7833, target_sparsity: 0.4, step: 40500
lambda_1: -0.2989, lambda_2: 798.2635 lambda_3: 0.0000
train remain: [0.99 0.99 1.   0.79 0.78 0.7  0.69 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.76, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101110000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
0000000000000000100000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.004105, lagrangian_loss: 0.060908, attention_score_distillation_loss: 0.000042
loss: 0.007662, lagrangian_loss: -0.000011, attention_score_distillation_loss: 0.000049
----------------------------------------------------------------------
time: 2023-07-19 16:56:19
Evaluating: accuracy: 0.9255, eval_loss: 0.3706, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4502, expected_sparsity: 0.4374, expected_sequence_sparsity: 0.7833, target_sparsity: 0.4, step: 41000
lambda_1: -0.5233, lambda_2: 809.8208 lambda_3: 0.0000
train remain: [0.99 0.99 0.99 0.78 0.78 0.7  0.69 0.52 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.76, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101110000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
0000000000000000000000001
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.005376, lagrangian_loss: 0.003592, attention_score_distillation_loss: 0.000048
loss: 0.004263, lagrangian_loss: 0.010528, attention_score_distillation_loss: 0.000053
----------------------------------------------------------------------
time: 2023-07-19 16:57:49
Evaluating: accuracy: 0.9209, eval_loss: 0.3861, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4502, expected_sparsity: 0.4374, expected_sequence_sparsity: 0.7833, target_sparsity: 0.4, step: 41500
lambda_1: -0.2841, lambda_2: 821.4862 lambda_3: 0.0000
train remain: [0.99 0.99 1.   0.78 0.82 0.7  0.69 0.53 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.76, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.58, 0.39, 0.27, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101110000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
0000000000010000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.017162, lagrangian_loss: 0.084159, attention_score_distillation_loss: 0.000045
loss: 0.009848, lagrangian_loss: 0.042307, attention_score_distillation_loss: 0.000058
----------------------------------------------------------------------
time: 2023-07-19 16:59:19
Evaluating: accuracy: 0.9266, eval_loss: 0.3469, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 42000
lambda_1: -0.2932, lambda_2: 831.1076 lambda_3: 0.0000
train remain: [0.99 0.99 0.99 0.78 0.81 0.7  0.68 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
0000000000000000000000001
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.008410, lagrangian_loss: 0.160790, attention_score_distillation_loss: 0.000042
ETA: 2:10:39 | Epoch 19 finished. Took 381.39 seconds.
loss: 0.004791, lagrangian_loss: 0.001433, attention_score_distillation_loss: 0.000046
----------------------------------------------------------------------
time: 2023-07-19 17:00:48
Evaluating: accuracy: 0.9255, eval_loss: 0.3558, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 42500
lambda_1: -0.3749, lambda_2: 842.8365 lambda_3: 0.0000
train remain: [0.99 0.99 1.   0.78 0.8  0.7  0.68 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.009855, lagrangian_loss: 0.007496, attention_score_distillation_loss: 0.000052
loss: 0.006313, lagrangian_loss: 0.001644, attention_score_distillation_loss: 0.000053
----------------------------------------------------------------------
time: 2023-07-19 17:02:18
Evaluating: accuracy: 0.9243, eval_loss: 0.3661, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 43000
lambda_1: -0.2701, lambda_2: 854.6217 lambda_3: 0.0000
train remain: [0.99 0.99 1.   0.79 0.79 0.69 0.69 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100100000000000000000
0001000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.005993, lagrangian_loss: -0.000014, attention_score_distillation_loss: 0.000053
loss: 0.006569, lagrangian_loss: 0.026503, attention_score_distillation_loss: 0.000058
----------------------------------------------------------------------
time: 2023-07-19 17:03:48
Evaluating: accuracy: 0.9232, eval_loss: 0.367, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 43500
lambda_1: -0.2675, lambda_2: 866.8499 lambda_3: 0.0000
train remain: [0.99 0.99 1.   0.79 0.78 0.69 0.69 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100100000000000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.008160, lagrangian_loss: 0.026365, attention_score_distillation_loss: 0.000056
loss: 0.008577, lagrangian_loss: 0.075638, attention_score_distillation_loss: 0.000040
----------------------------------------------------------------------
time: 2023-07-19 17:05:18
Evaluating: accuracy: 0.9243, eval_loss: 0.376, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 44000
lambda_1: -0.4296, lambda_2: 878.5640 lambda_3: 0.0000
train remain: [0.99 0.99 1.   0.81 0.76 0.69 0.69 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100100000000000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.007913, lagrangian_loss: 0.072113, attention_score_distillation_loss: 0.000063
ETA: 2:03:53 | Epoch 20 finished. Took 376.69 seconds.
loss: 0.004072, lagrangian_loss: 0.004772, attention_score_distillation_loss: 0.000045
----------------------------------------------------------------------
time: 2023-07-19 17:06:47
Evaluating: accuracy: 0.9243, eval_loss: 0.3733, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 44500
lambda_1: -0.4339, lambda_2: 888.7227 lambda_3: 0.0000
train remain: [0.99 0.99 1.   0.8  0.76 0.69 0.69 0.52 0.08 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100100000000000000000
0000000000000000000000001
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.127613, lagrangian_loss: 0.167968, attention_score_distillation_loss: 0.000063
loss: 0.010145, lagrangian_loss: 0.010796, attention_score_distillation_loss: 0.000060
----------------------------------------------------------------------
time: 2023-07-19 17:08:17
Evaluating: accuracy: 0.9278, eval_loss: 0.3555, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 45000
lambda_1: -0.3059, lambda_2: 900.1195 lambda_3: 0.0000
train remain: [1.   0.99 1.   0.8  0.77 0.69 0.69 0.52 0.08 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
0000000000000000000000001
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.008137, lagrangian_loss: 0.080227, attention_score_distillation_loss: 0.000060
loss: 0.008451, lagrangian_loss: -0.000030, attention_score_distillation_loss: 0.000050
----------------------------------------------------------------------
time: 2023-07-19 17:09:47
Evaluating: accuracy: 0.9278, eval_loss: 0.3421, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 45500
lambda_1: -0.3026, lambda_2: 912.6192 lambda_3: 0.0000
train remain: [1.   0.99 1.   0.8  0.76 0.69 0.69 0.52 0.08 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000000000010000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.003658, lagrangian_loss: 0.010049, attention_score_distillation_loss: 0.000056
loss: 0.005571, lagrangian_loss: 0.008614, attention_score_distillation_loss: 0.000048
----------------------------------------------------------------------
time: 2023-07-19 17:11:16
Evaluating: accuracy: 0.9266, eval_loss: 0.3548, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 46000
lambda_1: -0.4178, lambda_2: 923.5681 lambda_3: 0.0000
train remain: [1.   0.99 1.   0.8  0.77 0.69 0.69 0.52 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.005839, lagrangian_loss: 0.023733, attention_score_distillation_loss: 0.000045
loss: 0.006106, lagrangian_loss: 0.036313, attention_score_distillation_loss: 0.000045
ETA: 1:57:09 | Epoch 21 finished. Took 376.18 seconds.
----------------------------------------------------------------------
time: 2023-07-19 17:12:46
Evaluating: accuracy: 0.9255, eval_loss: 0.3507, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 46500
lambda_1: -0.1736, lambda_2: 936.0222 lambda_3: 0.0000
train remain: [1.   1.   1.   0.8  0.78 0.7  0.7  0.52 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100001000000000000000
0000000000000000100000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.009574, lagrangian_loss: 0.010538, attention_score_distillation_loss: 0.000060
loss: 0.005597, lagrangian_loss: 0.038347, attention_score_distillation_loss: 0.000047
----------------------------------------------------------------------
time: 2023-07-19 17:14:15
Evaluating: accuracy: 0.9289, eval_loss: 0.3452, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 47000
lambda_1: -0.0433, lambda_2: 948.1556 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.78 0.7  0.7  0.52 0.09 0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000001000000000000
0000000000000000001000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.024184, lagrangian_loss: 0.042090, attention_score_distillation_loss: 0.000050
loss: 0.004610, lagrangian_loss: 0.057158, attention_score_distillation_loss: 0.000056
----------------------------------------------------------------------
time: 2023-07-19 17:15:45
Evaluating: accuracy: 0.9289, eval_loss: 0.3383, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 47500
lambda_1: -0.2667, lambda_2: 960.0967 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.77 0.7  0.69 0.52 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
0000000010000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.017927, lagrangian_loss: 0.063212, attention_score_distillation_loss: 0.000043
loss: 0.008120, lagrangian_loss: 0.003447, attention_score_distillation_loss: 0.000052
----------------------------------------------------------------------
time: 2023-07-19 17:17:15
Evaluating: accuracy: 0.9278, eval_loss: 0.3528, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 48000
lambda_1: -0.2235, lambda_2: 971.6016 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.76 0.7  0.7  0.52 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000101000000000000000000
0000000000000000010000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.004714, lagrangian_loss: 0.037656, attention_score_distillation_loss: 0.000055
loss: 0.005076, lagrangian_loss: 0.190249, attention_score_distillation_loss: 0.000066
ETA: 1:50:29 | Epoch 22 finished. Took 376.96 seconds.
----------------------------------------------------------------------
time: 2023-07-19 17:18:45
Evaluating: accuracy: 0.9289, eval_loss: 0.3311, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 48500
lambda_1: -0.1407, lambda_2: 982.6774 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.78 0.7  0.7  0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000001000000000000
0000000000000000100000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.003278, lagrangian_loss: 0.023118, attention_score_distillation_loss: 0.000048
loss: 0.010283, lagrangian_loss: 0.086443, attention_score_distillation_loss: 0.000060
----------------------------------------------------------------------
time: 2023-07-19 17:20:15
Evaluating: accuracy: 0.9232, eval_loss: 0.3612, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 49000
lambda_1: -0.3675, lambda_2: 995.7005 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.77 0.7  0.7  0.52 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000000000100000000
0000000000000000100000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.009656, lagrangian_loss: 0.009961, attention_score_distillation_loss: 0.000056
loss: 0.005070, lagrangian_loss: 0.001369, attention_score_distillation_loss: 0.000052
----------------------------------------------------------------------
time: 2023-07-19 17:21:44
Evaluating: accuracy: 0.9232, eval_loss: 0.3452, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 49500
lambda_1: -0.4518, lambda_2: 1006.0292 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.76 0.69 0.69 0.52 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.023249, lagrangian_loss: 0.011005, attention_score_distillation_loss: 0.000053
loss: 0.009364, lagrangian_loss: 0.001575, attention_score_distillation_loss: 0.000052
----------------------------------------------------------------------
time: 2023-07-19 17:23:13
Evaluating: accuracy: 0.9232, eval_loss: 0.3698, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 50000
lambda_1: -0.2274, lambda_2: 1017.4321 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.77 0.69 0.69 0.52 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000101000000000000000000
0000000000001000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.126595, lagrangian_loss: 0.004380, attention_score_distillation_loss: 0.000053
loss: 0.235262, lagrangian_loss: 0.000166, attention_score_distillation_loss: 0.000047
----------------------------------------------------------------------
time: 2023-07-19 17:24:41
Evaluating: accuracy: 0.9186, eval_loss: 0.3889, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 50500
lambda_1: -0.1758, lambda_2: 1028.5825 lambda_3: 0.0000
train remain: [1.   1.   0.99 0.79 0.76 0.7  0.7  0.52 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0001100000000000000000000
0000000000000000100000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.221742, lagrangian_loss: 0.034486, attention_score_distillation_loss: 0.000043
ETA: 1:43:51 | Epoch 23 finished. Took 378.66 seconds.
loss: 0.008768, lagrangian_loss: 0.084534, attention_score_distillation_loss: 0.000046
----------------------------------------------------------------------
time: 2023-07-19 17:26:10
Evaluating: accuracy: 0.9232, eval_loss: 0.3731, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 51000
lambda_1: -0.1421, lambda_2: 1039.8231 lambda_3: 0.0000
train remain: [1.   1.   1.   0.8  0.75 0.69 0.7  0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000000000001000000
0000000000000000000000001
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.010449, lagrangian_loss: 0.033116, attention_score_distillation_loss: 0.000046
loss: 0.004404, lagrangian_loss: 0.003113, attention_score_distillation_loss: 0.000057
----------------------------------------------------------------------
time: 2023-07-19 17:27:38
Evaluating: accuracy: 0.9278, eval_loss: 0.344, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 51500
lambda_1: -0.0429, lambda_2: 1051.0416 lambda_3: 0.0000
train remain: [1.   1.   1.   0.8  0.75 0.7  0.7  0.53 0.09 0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.007905, lagrangian_loss: 0.015633, attention_score_distillation_loss: 0.000057
loss: 0.007022, lagrangian_loss: 0.001843, attention_score_distillation_loss: 0.000055
----------------------------------------------------------------------
time: 2023-07-19 17:29:07
Evaluating: accuracy: 0.922, eval_loss: 0.3722, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 52000
lambda_1: -0.2290, lambda_2: 1062.3442 lambda_3: 0.0000
train remain: [1.   1.   1.   0.81 0.76 0.7  0.7  0.53 0.09 0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000000000001000000
0000000000000000000000001
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.004420, lagrangian_loss: 0.000912, attention_score_distillation_loss: 0.000051
loss: 0.006146, lagrangian_loss: 0.110975, attention_score_distillation_loss: 0.000039
----------------------------------------------------------------------
time: 2023-07-19 17:30:34
Evaluating: accuracy: 0.9243, eval_loss: 0.3804, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 52500
lambda_1: -0.2784, lambda_2: 1073.5000 lambda_3: 0.0000
train remain: [1.   1.   1.   0.8  0.75 0.69 0.69 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100010000000000000000
0000000000001000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.297986, lagrangian_loss: 0.010362, attention_score_distillation_loss: 0.000058
ETA: 1:37:10 | Epoch 24 finished. Took 370.32 seconds.
loss: 0.014081, lagrangian_loss: 0.103171, attention_score_distillation_loss: 0.000044
----------------------------------------------------------------------
time: 2023-07-19 17:32:01
Evaluating: accuracy: 0.9243, eval_loss: 0.361, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 53000
lambda_1: 0.0155, lambda_2: 1085.3218 lambda_3: 0.0000
train remain: [1.   1.   1.   0.8  0.76 0.7  0.7  0.54 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
0000000010000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.005242, lagrangian_loss: 0.003217, attention_score_distillation_loss: 0.000059
loss: 0.047305, lagrangian_loss: 0.103842, attention_score_distillation_loss: 0.000041
----------------------------------------------------------------------
time: 2023-07-19 17:33:27
Evaluating: accuracy: 0.9186, eval_loss: 0.3761, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 53500
lambda_1: -0.0985, lambda_2: 1097.8932 lambda_3: 0.0000
train remain: [1.   1.   1.   0.8  0.76 0.7  0.7  0.53 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
0000000000000000001000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.005138, lagrangian_loss: 0.311181, attention_score_distillation_loss: 0.000039
loss: 0.004399, lagrangian_loss: 0.045660, attention_score_distillation_loss: 0.000056
----------------------------------------------------------------------
time: 2023-07-19 17:34:54
Evaluating: accuracy: 0.9163, eval_loss: 0.4068, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 54000
lambda_1: -0.0808, lambda_2: 1108.8696 lambda_3: 0.0000
train remain: [1.   1.   1.   0.8  0.76 0.7  0.7  0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000000010000000000
0000000000001000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.005563, lagrangian_loss: 0.000490, attention_score_distillation_loss: 0.000052
loss: 0.002701, lagrangian_loss: 0.000310, attention_score_distillation_loss: 0.000057
----------------------------------------------------------------------
time: 2023-07-19 17:36:21
Evaluating: accuracy: 0.922, eval_loss: 0.3859, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 54500
lambda_1: -0.4706, lambda_2: 1120.7057 lambda_3: 0.0000
train remain: [1.   0.99 1.   0.79 0.77 0.69 0.69 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000000000001000000
0000000000000000000000001
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.005046, lagrangian_loss: 0.000858, attention_score_distillation_loss: 0.000052
ETA: 1:30:28 | Epoch 25 finished. Took 364.16 seconds.
loss: 0.008167, lagrangian_loss: 0.006928, attention_score_distillation_loss: 0.000048
----------------------------------------------------------------------
time: 2023-07-19 17:37:48
Evaluating: accuracy: 0.9243, eval_loss: 0.3639, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 55000
lambda_1: -0.1750, lambda_2: 1131.4738 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.77 0.7  0.7  0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100001000000000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.007312, lagrangian_loss: 0.030305, attention_score_distillation_loss: 0.000057
loss: 0.003924, lagrangian_loss: 0.001943, attention_score_distillation_loss: 0.000051
----------------------------------------------------------------------
time: 2023-07-19 17:39:14
Evaluating: accuracy: 0.9232, eval_loss: 0.3751, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 55500
lambda_1: -0.3222, lambda_2: 1143.6475 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.77 0.69 0.69 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
0000000000000000000000001
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.044854, lagrangian_loss: 0.166533, attention_score_distillation_loss: 0.000061
loss: 0.004780, lagrangian_loss: 0.001529, attention_score_distillation_loss: 0.000055
----------------------------------------------------------------------
time: 2023-07-19 17:40:41
Evaluating: accuracy: 0.9255, eval_loss: 0.3674, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 56000
lambda_1: -0.2564, lambda_2: 1155.0938 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.77 0.69 0.69 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000001000000000000
0000000000000000000000001
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.006782, lagrangian_loss: 0.052921, attention_score_distillation_loss: 0.000044
loss: 0.004538, lagrangian_loss: 0.004323, attention_score_distillation_loss: 0.000052
----------------------------------------------------------------------
time: 2023-07-19 17:42:08
Evaluating: accuracy: 0.9266, eval_loss: 0.3523, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 56500
lambda_1: -0.2556, lambda_2: 1167.1124 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.76 0.69 0.69 0.53 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
1000000000000000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.002842, lagrangian_loss: 0.047573, attention_score_distillation_loss: 0.000046
loss: 0.011549, lagrangian_loss: 0.000071, attention_score_distillation_loss: 0.000054
ETA: 1:23:49 | Epoch 26 finished. Took 363.92 seconds.
----------------------------------------------------------------------
time: 2023-07-19 17:43:34
Evaluating: accuracy: 0.9232, eval_loss: 0.3675, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 57000
lambda_1: -0.1855, lambda_2: 1179.0197 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.76 0.7  0.7  0.54 0.09 0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000000000001000000
0000000000001000000000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.003182, lagrangian_loss: 0.004353, attention_score_distillation_loss: 0.000052
loss: 0.005471, lagrangian_loss: 0.041603, attention_score_distillation_loss: 0.000052
----------------------------------------------------------------------
time: 2023-07-19 17:45:00
Evaluating: accuracy: 0.9266, eval_loss: 0.341, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 57500
lambda_1: -0.3045, lambda_2: 1190.6985 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.76 0.69 0.69 0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000000000001000000
0000000000000000000000001
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.009677, lagrangian_loss: 0.003945, attention_score_distillation_loss: 0.000049
loss: 0.009485, lagrangian_loss: 0.020115, attention_score_distillation_loss: 0.000051
----------------------------------------------------------------------
time: 2023-07-19 17:46:27
Evaluating: accuracy: 0.9243, eval_loss: 0.3598, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 58000
lambda_1: -0.2953, lambda_2: 1201.7389 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.76 0.7  0.7  0.54 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111111010110100000
1000100000000000000000000
0000000000000000000000001
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
loss: 0.003310, lagrangian_loss: 0.022545, attention_score_distillation_loss: 0.000050
loss: 0.004362, lagrangian_loss: 0.040250, attention_score_distillation_loss: 0.000054
----------------------------------------------------------------------
time: 2023-07-19 17:47:53
Evaluating: accuracy: 0.9335, eval_loss: 0.3285, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 58500
lambda_1: -0.2262, lambda_2: 1213.2111 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.76 0.7  0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000001000000000000
0000000000000000100000000
Best eval score so far: 0.9289 @ step 28500 epoch 13.54
Saving the best model so far: [Epoch 27 | Step: 58500 | MACs sparsity: 0.4535 | Score: 0.9335 | Loss: 0.3285]
loss: 0.005294, lagrangian_loss: 0.079756, attention_score_distillation_loss: 0.000044
loss: 0.003799, lagrangian_loss: 0.011152, attention_score_distillation_loss: 0.000046
ETA: 1:17:20 | Epoch 27 finished. Took 382.14 seconds.
----------------------------------------------------------------------
time: 2023-07-19 17:49:39
Evaluating: accuracy: 0.9163, eval_loss: 0.3892, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 59000
lambda_1: -0.1040, lambda_2: 1225.4785 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.76 0.7  0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
1000100000000000000000000
1000000000000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.030617, lagrangian_loss: 0.000033, attention_score_distillation_loss: 0.000052
loss: 0.003273, lagrangian_loss: 0.011096, attention_score_distillation_loss: 0.000051
----------------------------------------------------------------------
time: 2023-07-19 17:51:06
Evaluating: accuracy: 0.9232, eval_loss: 0.357, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 59500
lambda_1: -0.1813, lambda_2: 1237.4011 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.75 0.7  0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000001000000000000
0000000000001000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.003005, lagrangian_loss: 0.053603, attention_score_distillation_loss: 0.000040
loss: 0.009924, lagrangian_loss: 0.013898, attention_score_distillation_loss: 0.000054
----------------------------------------------------------------------
time: 2023-07-19 17:52:33
Evaluating: accuracy: 0.9197, eval_loss: 0.3741, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 60000
lambda_1: -0.1195, lambda_2: 1248.5291 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.76 0.7  0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000001000000000000
0000000000001000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.010397, lagrangian_loss: 0.010281, attention_score_distillation_loss: 0.000062
loss: 0.004336, lagrangian_loss: 0.000046, attention_score_distillation_loss: 0.000056
----------------------------------------------------------------------
time: 2023-07-19 17:53:59
Evaluating: accuracy: 0.9197, eval_loss: 0.3875, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 60500
lambda_1: -0.0746, lambda_2: 1259.7573 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.76 0.7  0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010111100000
0000100010000000000000000
1000000000000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.008605, lagrangian_loss: 0.146015, attention_score_distillation_loss: 0.000067
loss: 0.014067, lagrangian_loss: 0.003472, attention_score_distillation_loss: 0.000052
----------------------------------------------------------------------
time: 2023-07-19 17:55:25
Evaluating: accuracy: 0.9232, eval_loss: 0.3866, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 61000
lambda_1: 0.0643, lambda_2: 1271.1039 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.76 0.7  0.71 0.55 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010111100000
0000100000000000010000000
0000000000001000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.004650, lagrangian_loss: 0.044687, attention_score_distillation_loss: 0.000059
ETA: 1:10:46 | Epoch 28 finished. Took 367.57 seconds.
loss: 0.004408, lagrangian_loss: 0.054539, attention_score_distillation_loss: 0.000044
----------------------------------------------------------------------
time: 2023-07-19 17:56:53
Evaluating: accuracy: 0.9232, eval_loss: 0.3698, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4432, expected_sequence_sparsity: 0.7856, target_sparsity: 0.4, step: 61500
lambda_1: -0.3601, lambda_2: 1283.4346 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.52, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.13, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100000
0000100000001000000000000
0000000000000000100000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.006690, lagrangian_loss: 0.037820, attention_score_distillation_loss: 0.000054
loss: 0.006226, lagrangian_loss: 0.014205, attention_score_distillation_loss: 0.000046
----------------------------------------------------------------------
time: 2023-07-19 17:58:19
Evaluating: accuracy: 0.9209, eval_loss: 0.3866, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 62000
lambda_1: -0.0337, lambda_2: 1294.3456 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.75 0.7  0.71 0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111111111111010111010000
0101101111011010110101000
0000100000000010000000000
0000000000000000100000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.003804, lagrangian_loss: 0.040909, attention_score_distillation_loss: 0.000059
loss: 0.005209, lagrangian_loss: 0.072344, attention_score_distillation_loss: 0.000061
----------------------------------------------------------------------
time: 2023-07-19 17:59:45
Evaluating: accuracy: 0.9232, eval_loss: 0.3961, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 62500
lambda_1: -0.2674, lambda_2: 1305.4608 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.71 0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111111111111010111010000
0101101111011010110110000
0000100010000000000000000
0000000000000000000000001
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.006811, lagrangian_loss: 0.008639, attention_score_distillation_loss: 0.000052
loss: 0.003622, lagrangian_loss: 0.017137, attention_score_distillation_loss: 0.000049
----------------------------------------------------------------------
time: 2023-07-19 18:01:12
Evaluating: accuracy: 0.9209, eval_loss: 0.3866, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 63000
lambda_1: -0.0440, lambda_2: 1317.6906 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.76 0.7  0.72 0.55 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111111111111010111010000
1101101111011010110100000
0000101000000000000000000
1000000000000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.002448, lagrangian_loss: 0.346444, attention_score_distillation_loss: 0.000041
ETA: 1:04:13 | Epoch 29 finished. Took 363.78 seconds.
loss: 0.004151, lagrangian_loss: 0.030700, attention_score_distillation_loss: 0.000045
----------------------------------------------------------------------
time: 2023-07-19 18:02:38
Evaluating: accuracy: 0.9232, eval_loss: 0.3837, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 63500
lambda_1: -0.1709, lambda_2: 1329.0950 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.75 0.7  0.71 0.55 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111111111111010111010000
0101101111111010110100000
0000100100000000000000000
1000000000000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.005317, lagrangian_loss: 0.055183, attention_score_distillation_loss: 0.000058
loss: 0.004955, lagrangian_loss: 0.108678, attention_score_distillation_loss: 0.000061
----------------------------------------------------------------------
time: 2023-07-19 18:04:04
Evaluating: accuracy: 0.914, eval_loss: 0.4174, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 64000
lambda_1: -0.1744, lambda_2: 1340.0244 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.71 0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111111111111010111010000
0101101111011010110110000
0000100000000000010000000
0000000000000000000001000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.010350, lagrangian_loss: 0.182796, attention_score_distillation_loss: 0.000065
loss: 0.002113, lagrangian_loss: 0.012199, attention_score_distillation_loss: 0.000054
----------------------------------------------------------------------
time: 2023-07-19 18:05:30
Evaluating: accuracy: 0.9232, eval_loss: 0.3823, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 64500
lambda_1: -0.2126, lambda_2: 1352.1821 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.74 0.7  0.71 0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111111111111010111010000
1101101111011010110100000
0000100000000000001000000
0000000010000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.002971, lagrangian_loss: 0.083132, attention_score_distillation_loss: 0.000058
loss: 0.006463, lagrangian_loss: 0.386724, attention_score_distillation_loss: 0.000041
----------------------------------------------------------------------
time: 2023-07-19 18:06:57
Evaluating: accuracy: 0.9186, eval_loss: 0.3901, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 65000
lambda_1: -0.2470, lambda_2: 1364.3120 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.75 0.7  0.71 0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111111111111010111010000
0101101111011110110100000
0010100000000000000000000
0000000000000000100000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.007009, lagrangian_loss: 0.021756, attention_score_distillation_loss: 0.000048
loss: 0.005223, lagrangian_loss: 0.001395, attention_score_distillation_loss: 0.000049
ETA: 0:57:41 | Epoch 30 finished. Took 362.16 seconds.
----------------------------------------------------------------------
time: 2023-07-19 18:08:23
Evaluating: accuracy: 0.9186, eval_loss: 0.3888, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 65500
lambda_1: -0.2038, lambda_2: 1376.3102 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.75 0.7  0.71 0.55 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111111111111010111010000
0101101111011010110100100
0000100000100000000000000
0000000000000000100000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.010341, lagrangian_loss: 0.001311, attention_score_distillation_loss: 0.000050
loss: 0.001989, lagrangian_loss: 0.209403, attention_score_distillation_loss: 0.000042
----------------------------------------------------------------------
time: 2023-07-19 18:09:50
Evaluating: accuracy: 0.9186, eval_loss: 0.4016, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 66000
lambda_1: -0.2289, lambda_2: 1387.0728 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.74 0.69 0.71 0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111111111111010111010000
0101101111011010110100100
0000100001000000000000000
0000000000000000100000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.002681, lagrangian_loss: 0.101971, attention_score_distillation_loss: 0.000060
loss: 0.004634, lagrangian_loss: 0.004062, attention_score_distillation_loss: 0.000055
----------------------------------------------------------------------
time: 2023-07-19 18:11:16
Evaluating: accuracy: 0.9266, eval_loss: 0.3666, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 66500
lambda_1: -0.0507, lambda_2: 1399.0841 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.75 0.7  0.71 0.55 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111111111111010111010000
0101101111011010111100000
0000100000000000100000000
0000000000000000100000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.002087, lagrangian_loss: 0.129615, attention_score_distillation_loss: 0.000044
loss: 0.004971, lagrangian_loss: 0.000881, attention_score_distillation_loss: 0.000053
----------------------------------------------------------------------
time: 2023-07-19 18:12:43
Evaluating: accuracy: 0.9278, eval_loss: 0.356, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 67000
lambda_1: 0.0116, lambda_2: 1411.0002 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.75 0.7  0.71 0.55 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111111111111010111010000
0101101111011010111100000
1000100000000000000000000
0000000000000000100000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.008002, lagrangian_loss: 0.002378, attention_score_distillation_loss: 0.000054
loss: 0.010319, lagrangian_loss: 0.042074, attention_score_distillation_loss: 0.000045
ETA: 0:51:11 | Epoch 31 finished. Took 363.44 seconds.
----------------------------------------------------------------------
time: 2023-07-19 18:14:09
Evaluating: accuracy: 0.922, eval_loss: 0.3839, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 67500
lambda_1: -0.1236, lambda_2: 1422.6659 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.75 0.7  0.71 0.55 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111111111111010111010000
0101101111011010111100000
0000100000001000000000000
1000000000000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.006457, lagrangian_loss: 0.036834, attention_score_distillation_loss: 0.000053
loss: 0.013327, lagrangian_loss: 0.061022, attention_score_distillation_loss: 0.000056
----------------------------------------------------------------------
time: 2023-07-19 18:15:36
Evaluating: accuracy: 0.9197, eval_loss: 0.3936, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4404, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 68000
lambda_1: 0.0118, lambda_2: 1433.4221 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.75 0.7  0.72 0.55 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.08]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111111111111010111010000
0101101111011010110101000
0000100000000000001000000
0000000000000000100000001
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.010778, lagrangian_loss: 0.009595, attention_score_distillation_loss: 0.000049
loss: 0.004655, lagrangian_loss: 0.064071, attention_score_distillation_loss: 0.000043
----------------------------------------------------------------------
time: 2023-07-19 18:17:02
Evaluating: accuracy: 0.9186, eval_loss: 0.3921, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4405, expected_sequence_sparsity: 0.7845, target_sparsity: 0.4, step: 68500
lambda_1: 0.0033, lambda_2: 1443.8096 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.75 0.7  0.72 0.55 0.09 0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.72, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.27, 0.15, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111111111111010111010000
0101101111111010110100000
0001100000000000000000000
0000000000000000000000001
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.006267, lagrangian_loss: 0.000252, attention_score_distillation_loss: 0.000056
loss: 0.002737, lagrangian_loss: 0.000134, attention_score_distillation_loss: 0.000049
----------------------------------------------------------------------
time: 2023-07-19 18:18:29
Evaluating: accuracy: 0.922, eval_loss: 0.3809, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 69000
lambda_1: -0.1860, lambda_2: 1456.8696 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.71 0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111111010110100000
1000100000000000000000000
0000000000000000000000001
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.003123, lagrangian_loss: 0.007215, attention_score_distillation_loss: 0.000048
loss: 0.004314, lagrangian_loss: 0.018547, attention_score_distillation_loss: 0.000047
ETA: 0:44:42 | Epoch 32 finished. Took 362.69 seconds.
----------------------------------------------------------------------
time: 2023-07-19 18:19:55
Evaluating: accuracy: 0.9278, eval_loss: 0.3581, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 69500
lambda_1: -0.1446, lambda_2: 1467.4972 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.71 0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
1101101111011010110100000
0000100000001000000000000
1000000000000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.009108, lagrangian_loss: 0.016024, attention_score_distillation_loss: 0.000049
loss: 0.003129, lagrangian_loss: 0.013746, attention_score_distillation_loss: 0.000052
----------------------------------------------------------------------
time: 2023-07-19 18:21:21
Evaluating: accuracy: 0.9278, eval_loss: 0.3482, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 70000
lambda_1: -0.2399, lambda_2: 1479.5732 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.71 0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010111100000
0000100000001000000000000
1000000000000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.004541, lagrangian_loss: 0.023339, attention_score_distillation_loss: 0.000053
loss: 0.005145, lagrangian_loss: 0.000003, attention_score_distillation_loss: 0.000053
----------------------------------------------------------------------
time: 2023-07-19 18:22:48
Evaluating: accuracy: 0.9255, eval_loss: 0.3602, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 70500
lambda_1: -0.2053, lambda_2: 1491.0387 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.71 0.55 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010111100000
0000100100000000000000000
1000000000000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.007048, lagrangian_loss: 0.233375, attention_score_distillation_loss: 0.000066
loss: 0.008428, lagrangian_loss: 0.093717, attention_score_distillation_loss: 0.000045
----------------------------------------------------------------------
time: 2023-07-19 18:24:13
Evaluating: accuracy: 0.9174, eval_loss: 0.4037, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 71000
lambda_1: -0.2005, lambda_2: 1503.3418 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.71 0.55 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010111100000
0000100000000000100000000
1000000000000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.008609, lagrangian_loss: 0.084994, attention_score_distillation_loss: 0.000053
loss: 0.004410, lagrangian_loss: 0.015736, attention_score_distillation_loss: 0.000055
----------------------------------------------------------------------
time: 2023-07-19 18:25:40
Evaluating: accuracy: 0.9163, eval_loss: 0.4026, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 71500
lambda_1: -0.1630, lambda_2: 1515.7715 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.71 0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010111100000
0000100100000000000000000
1000000000000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.003061, lagrangian_loss: 0.190803, attention_score_distillation_loss: 0.000062
ETA: 0:38:16 | Epoch 33 finished. Took 366.76 seconds.
loss: 0.004931, lagrangian_loss: 0.000027, attention_score_distillation_loss: 0.000046
----------------------------------------------------------------------
time: 2023-07-19 18:27:06
Evaluating: accuracy: 0.9232, eval_loss: 0.3828, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 72000
lambda_1: -0.0772, lambda_2: 1527.7817 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.71 0.55 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110110000
0000100000001000000000000
0000000010000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.005254, lagrangian_loss: 0.091260, attention_score_distillation_loss: 0.000059
loss: 0.003216, lagrangian_loss: 0.035976, attention_score_distillation_loss: 0.000052
----------------------------------------------------------------------
time: 2023-07-19 18:28:33
Evaluating: accuracy: 0.9174, eval_loss: 0.3958, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 72500
lambda_1: 0.0546, lambda_2: 1539.7094 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.76 0.7  0.71 0.55 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110110000
0000100000001000000000000
0000000000000000000000001
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.001345, lagrangian_loss: 0.017621, attention_score_distillation_loss: 0.000053
loss: 0.001312, lagrangian_loss: 0.016323, attention_score_distillation_loss: 0.000053
----------------------------------------------------------------------
time: 2023-07-19 18:29:59
Evaluating: accuracy: 0.9186, eval_loss: 0.391, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 73000
lambda_1: -0.3442, lambda_2: 1551.1321 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.74 0.69 0.71 0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110110000
0000100000001000000000000
0000000000000000100000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.004728, lagrangian_loss: 0.036811, attention_score_distillation_loss: 0.000044
loss: 0.010639, lagrangian_loss: 0.007627, attention_score_distillation_loss: 0.000047
----------------------------------------------------------------------
time: 2023-07-19 18:31:25
Evaluating: accuracy: 0.9174, eval_loss: 0.4025, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 73500
lambda_1: -0.2420, lambda_2: 1562.0221 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.74 0.69 0.71 0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110110000
0000100000001000000000000
0000000000000000000100000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.001686, lagrangian_loss: 0.021263, attention_score_distillation_loss: 0.000052
ETA: 0:31:51 | Epoch 34 finished. Took 362.62 seconds.
loss: 0.003135, lagrangian_loss: 0.139665, attention_score_distillation_loss: 0.000042
----------------------------------------------------------------------
time: 2023-07-19 18:32:52
Evaluating: accuracy: 0.9186, eval_loss: 0.3941, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 74000
lambda_1: -0.1809, lambda_2: 1573.5197 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.74 0.7  0.71 0.55 0.09 0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110110000
0000100000001000000000000
0000000000001000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.002117, lagrangian_loss: 0.237573, attention_score_distillation_loss: 0.000040
loss: 0.238044, lagrangian_loss: 0.063505, attention_score_distillation_loss: 0.000053
----------------------------------------------------------------------
time: 2023-07-19 18:34:19
Evaluating: accuracy: 0.9209, eval_loss: 0.3868, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 74500
lambda_1: -0.1354, lambda_2: 1584.7434 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.75 0.7  0.71 0.55 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110110000
0000100010000000000000000
0000000000000000000100000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.004703, lagrangian_loss: 0.035853, attention_score_distillation_loss: 0.000054
loss: 0.003526, lagrangian_loss: 0.061310, attention_score_distillation_loss: 0.000043
----------------------------------------------------------------------
time: 2023-07-19 18:35:45
Evaluating: accuracy: 0.9174, eval_loss: 0.4025, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 75000
lambda_1: -0.0771, lambda_2: 1595.6375 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.75 0.7  0.71 0.55 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110110000
0000100000000000010000000
0000000000001000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.003222, lagrangian_loss: 0.010867, attention_score_distillation_loss: 0.000056
loss: 0.004039, lagrangian_loss: 0.151809, attention_score_distillation_loss: 0.000037
----------------------------------------------------------------------
time: 2023-07-19 18:37:11
Evaluating: accuracy: 0.9117, eval_loss: 0.4162, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 75500
lambda_1: -0.3404, lambda_2: 1607.3636 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.74 0.69 0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110110000
0000100000001000000000000
0000000000000000000000001
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.008898, lagrangian_loss: 0.002641, attention_score_distillation_loss: 0.000056
loss: 0.002288, lagrangian_loss: 0.005810, attention_score_distillation_loss: 0.000047
ETA: 0:25:26 | Epoch 35 finished. Took 362.8 seconds.
----------------------------------------------------------------------
time: 2023-07-19 18:38:37
Evaluating: accuracy: 0.9163, eval_loss: 0.3967, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 76000
lambda_1: -0.0507, lambda_2: 1617.9211 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.75 0.7  0.71 0.55 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0111101111011010110100000
0000100000001000000000000
1000000000000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.004749, lagrangian_loss: 0.004356, attention_score_distillation_loss: 0.000054
loss: 0.005166, lagrangian_loss: 0.071034, attention_score_distillation_loss: 0.000059
----------------------------------------------------------------------
time: 2023-07-19 18:40:04
Evaluating: accuracy: 0.914, eval_loss: 0.4179, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 76500
lambda_1: -0.0346, lambda_2: 1630.5061 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.75 0.7  0.7  0.55 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110101000
0000101000000000000000000
0000000000000000000100000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.107366, lagrangian_loss: 0.038062, attention_score_distillation_loss: 0.000047
loss: 0.002034, lagrangian_loss: 0.000957, attention_score_distillation_loss: 0.000053
----------------------------------------------------------------------
time: 2023-07-19 18:41:30
Evaluating: accuracy: 0.914, eval_loss: 0.4237, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 77000
lambda_1: -0.2779, lambda_2: 1642.2699 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
1101101111011010110100000
0000100000000000001000000
1000000000000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.001993, lagrangian_loss: 0.011436, attention_score_distillation_loss: 0.000049
loss: 0.019683, lagrangian_loss: 0.092736, attention_score_distillation_loss: 0.000063
----------------------------------------------------------------------
time: 2023-07-19 18:42:57
Evaluating: accuracy: 0.9197, eval_loss: 0.3978, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 77500
lambda_1: -0.0523, lambda_2: 1653.0314 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.7  0.55 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0111101111011010110100000
1000100000000000000000000
0000000000000000000100000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.003899, lagrangian_loss: 0.073292, attention_score_distillation_loss: 0.000058
loss: 0.002660, lagrangian_loss: 0.000018, attention_score_distillation_loss: 0.000047
ETA: 0:19:03 | Epoch 36 finished. Took 362.99 seconds.
----------------------------------------------------------------------
time: 2023-07-19 18:44:23
Evaluating: accuracy: 0.922, eval_loss: 0.3961, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 78000
lambda_1: -0.1274, lambda_2: 1664.6094 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.71 0.55 0.09 0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110110000
0000101000000000000000000
0000000000000000000000001
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.003181, lagrangian_loss: 0.000082, attention_score_distillation_loss: 0.000048
loss: 0.002963, lagrangian_loss: 0.032109, attention_score_distillation_loss: 0.000057
----------------------------------------------------------------------
time: 2023-07-19 18:45:50
Evaluating: accuracy: 0.9209, eval_loss: 0.387, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 78500
lambda_1: -0.3639, lambda_2: 1676.4302 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.74 0.69 0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110110000
0000100000001000000000000
0000000000000000100000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.005896, lagrangian_loss: 0.111578, attention_score_distillation_loss: 0.000059
loss: 0.012731, lagrangian_loss: 0.001073, attention_score_distillation_loss: 0.000053
----------------------------------------------------------------------
time: 2023-07-19 18:47:16
Evaluating: accuracy: 0.9163, eval_loss: 0.4137, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 79000
lambda_1: -0.0953, lambda_2: 1688.5875 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.7  0.55 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110110000
1000100000000000000000000
0000000000001000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.003938, lagrangian_loss: 0.001211, attention_score_distillation_loss: 0.000054
loss: 0.004430, lagrangian_loss: 0.008583, attention_score_distillation_loss: 0.000048
----------------------------------------------------------------------
time: 2023-07-19 18:48:42
Evaluating: accuracy: 0.9174, eval_loss: 0.4159, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 79500
lambda_1: -0.1537, lambda_2: 1699.3953 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110110000
0000100000000000100000000
0000000000000000001000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.002152, lagrangian_loss: 0.010100, attention_score_distillation_loss: 0.000048
loss: 0.003274, lagrangian_loss: 0.229384, attention_score_distillation_loss: 0.000044
ETA: 0:12:41 | Epoch 37 finished. Took 362.55 seconds.
----------------------------------------------------------------------
time: 2023-07-19 18:50:09
Evaluating: accuracy: 0.9174, eval_loss: 0.416, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 80000
lambda_1: 0.0127, lambda_2: 1710.7234 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.7  0.55 0.1  0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
1101101111011010110100000
1000100000000000000000000
0000000000001000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.001956, lagrangian_loss: 0.049769, attention_score_distillation_loss: 0.000057
loss: 0.007824, lagrangian_loss: 0.023596, attention_score_distillation_loss: 0.000051
----------------------------------------------------------------------
time: 2023-07-19 18:51:36
Evaluating: accuracy: 0.9186, eval_loss: 0.411, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 80500
lambda_1: -0.0600, lambda_2: 1724.6722 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111111010110100000
0000100000000000100000000
0000000000001000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.013019, lagrangian_loss: 0.202098, attention_score_distillation_loss: 0.000059
loss: 0.002397, lagrangian_loss: 0.062876, attention_score_distillation_loss: 0.000057
----------------------------------------------------------------------
time: 2023-07-19 18:53:03
Evaluating: accuracy: 0.9163, eval_loss: 0.4202, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 81000
lambda_1: -0.0943, lambda_2: 1735.4951 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
1101101111011010110100000
1000100000000000000000000
0010000000000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.004799, lagrangian_loss: 0.003932, attention_score_distillation_loss: 0.000053
loss: 0.002140, lagrangian_loss: 0.219619, attention_score_distillation_loss: 0.000058
----------------------------------------------------------------------
time: 2023-07-19 18:54:31
Evaluating: accuracy: 0.9186, eval_loss: 0.3913, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 81500
lambda_1: 0.0196, lambda_2: 1746.4094 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.75 0.7  0.7  0.55 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010111100000
0000100000001000000000000
1000000000000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.004118, lagrangian_loss: 0.097584, attention_score_distillation_loss: 0.000045
loss: 0.004009, lagrangian_loss: 0.012719, attention_score_distillation_loss: 0.000052
----------------------------------------------------------------------
time: 2023-07-19 18:55:59
Evaluating: accuracy: 0.9186, eval_loss: 0.4033, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 82000
lambda_1: -0.0256, lambda_2: 1756.4104 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.75 0.7  0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010111100000
0000100000001000000000000
1000000000000000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.004811, lagrangian_loss: 0.020769, attention_score_distillation_loss: 0.000056
ETA: 0:06:20 | Epoch 38 finished. Took 372.34 seconds.
loss: 0.006787, lagrangian_loss: 0.000367, attention_score_distillation_loss: 0.000057
----------------------------------------------------------------------
time: 2023-07-19 18:57:27
Evaluating: accuracy: 0.9174, eval_loss: 0.4066, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 82500
lambda_1: -0.1395, lambda_2: 1767.7959 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.74 0.7  0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010110100010
0000100000001000000000000
0000000000001000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.002873, lagrangian_loss: 0.066111, attention_score_distillation_loss: 0.000048
loss: 0.001274, lagrangian_loss: 0.008245, attention_score_distillation_loss: 0.000055
----------------------------------------------------------------------
time: 2023-07-19 18:58:55
Evaluating: accuracy: 0.9174, eval_loss: 0.4043, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 83000
lambda_1: -0.0655, lambda_2: 1778.9573 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.74 0.7  0.7  0.55 0.1  0.06]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
1101101111011010110100000
0001100000000000000000000
0000000000000000100000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.006256, lagrangian_loss: 0.026077, attention_score_distillation_loss: 0.000057
loss: 0.006882, lagrangian_loss: 0.016961, attention_score_distillation_loss: 0.000053
----------------------------------------------------------------------
time: 2023-07-19 19:00:24
Evaluating: accuracy: 0.9174, eval_loss: 0.4029, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 83500
lambda_1: -0.1752, lambda_2: 1789.7844 lambda_3: 0.0000
train remain: [1.   1.   1.   0.78 0.74 0.7  0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
0101101111011010111100000
1000100000000000000000000
0000000000000000000000001
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.003101, lagrangian_loss: 0.155395, attention_score_distillation_loss: 0.000056
loss: 0.003904, lagrangian_loss: 0.006937, attention_score_distillation_loss: 0.000055
----------------------------------------------------------------------
time: 2023-07-19 19:01:53
Evaluating: accuracy: 0.9174, eval_loss: 0.4077, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4535, expected_sparsity: 0.4423, expected_sequence_sparsity: 0.7852, target_sparsity: 0.4, step: 84000
lambda_1: -0.0343, lambda_2: 1800.2780 lambda_3: 0.0000
train remain: [1.   1.   1.   0.79 0.74 0.7  0.7  0.54 0.09 0.05]
infer remain: [1.0, 1.0, 1.0, 0.76, 0.72, 0.68, 0.68, 0.56, 0.08, 0.04]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.55, 0.37, 0.25, 0.14, 0.01, 0.0]
1111111111111111111111111
1111111111111111111111111
1111111111111111111111111
1111111111111110101101000
1111111111111110101100000
1111111111011110101100000
1111101111111010111010000
1101101111011010110100000
0000100000000010000000000
0000000000001000000000000
Best eval score so far: 0.9335 @ step 58500 epoch 27.79
loss: 0.001714, lagrangian_loss: 0.187577, attention_score_distillation_loss: 0.000062
ETA: 0:00:00 | Epoch 39 finished. Took 371.54 seconds.
07/19/2023 19:04:32 - WARNING - urllib3.connectionpool - Retrying (Retry(total=4, connect=5, read=4, redirect=5, status=5)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='southcentralus.api.azureml.ms', port=443): Read timed out. (read timeout=120)")': /mlflow/v2.0/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourceGroups/gcr-singularity-octo/providers/Microsoft.MachineLearningServices/workspaces/msroctows/api/2.0/mlflow/runs/get?run_uuid=abaf7266-b685-4ed2-977e-c3790b442fc2&run_id=abaf7266-b685-4ed2-977e-c3790b442fc2