/home/aiscuser/.local/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" 2023/07/19 14:34:09 WARNING mlflow.utils.autologging_utils: You are using an unsupported version of transformers. If you encounter errors during autologging, try upgrading / downgrading transformers to a supported version, or try upgrading MLflow. 2023/07/19 14:34:09 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn. 2023/07/19 14:34:09 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers. Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Downloading and preparing dataset glue/qqp to /home/aiscuser/.cache/huggingface/datasets/glue/qqp/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353... Downloading data: 0%| | 0.00/41.7M [00:00 Training Arguments TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, bf16=False, bf16_full_eval=False, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_steps=3000, evaluation_strategy=IntervalStrategy.STEPS, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=40, log_level_replica=-1, log_on_each_node=True, logging_dir=/mnt/data/device-aware-bert/token_pruning/experiments/QQP/reproduce1/s0.65_lr2e-05_reglr0.01_alpha0.0001_warmup10_bin50/runs/Jul19_14-34-10_node-0, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=100, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=40.0, optim=OptimizerNames.ADAMW_HF, output_dir=/mnt/data/device-aware-bert/token_pruning/experiments/QQP/reproduce1/s0.65_lr2e-05_reglr0.01_alpha0.0001_warmup10_bin50, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=32, per_device_train_batch_size=32, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, remove_unused_columns=True, report_to=['mlflow'], resume_from_checkpoint=None, run_name=/mnt/data/device-aware-bert/token_pruning/experiments/QQP/reproduce1/s0.65_lr2e-05_reglr0.01_alpha0.0001_warmup10_bin50, save_on_each_node=False, save_steps=0, save_strategy=IntervalStrategy.STEPS, save_total_limit=None, seed=57, sharded_ddp=[], skip_memory_metrics=True, tf32=None, tpu_metrics_debug=False, tpu_num_cores=None, use_legacy_prediction_loop=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) Additional Arguments AdditionalArguments(test=False, ex_name='s0.65_lr2e-05_reglr0.01_alpha0.0001_warmup10_bin50', pruning_type='token+pruner', reg_learning_rate=0.01, scheduler_type='linear', freeze_embeddings=True, pretrained_pruned_model=None, droprate_init=0.01, temperature=0.6666666666666666, prepruning_finetune_epochs=1, lagrangian_warmup_epochs=10, target_sparsity=0.65, sparsity_epsilon=0, distillation_path='/mnt/data/device-aware-bert/token_pruning/teachers/QQP', do_distill=True, do_layer_distill=False, layer_distill_version=4, distill_loss_alpha=0.9, distill_ce_loss_alpha=0.0001, distill_temp=2.0, use_mac_l0=True, prune_location=[3, 4, 5, 6, 7, 8, 9, 10, 11], bin_num=50, topk=20) ---------------------------------------------------------------------- time: 2023-07-19 14:41:28 Evaluating: accuracy: 0.912, eval_loss: 0.4066, step: 0 lambda_1: 0.0000, lambda_2: 0.0000 lambda_3: 0.0000 Starting l0 regularization! using , temperature: 0.67, init drop rate: 0.01 token_loga shape: [9, 50] prune location: [3, 4, 5, 6, 7, 8, 9, 10, 11] NDCG TOPK= 20 loss: 0.026014, lagrangian_loss: -0.002583, attention_score_distillation_loss: 0.000970 ---------------------------------------------------------------------- time: 2023-07-19 14:55:50 Evaluating: accuracy: 0.9052, eval_loss: 0.4611, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.766, target_sparsity: 0.0171, step: 3000 lambda_1: 0.7990, lambda_2: 36.6934 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 0.99 1. 0.99] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 loss: 0.051543, lagrangian_loss: -0.004071, attention_score_distillation_loss: 0.000949 loss: 0.020752, lagrangian_loss: 0.005120, attention_score_distillation_loss: 0.000937 ---------------------------------------------------------------------- time: 2023-07-19 15:10:12 Evaluating: accuracy: 0.903, eval_loss: 0.4714, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0278, expected_sparsity: 0.0241, expected_sequence_sparsity: 0.7716, target_sparsity: 0.0343, step: 6000 lambda_1: -2.4793, lambda_2: 48.1642 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 0.99 1. 0.83] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 10111111111111111111111110111101111101110111100100 loss: 0.108954, lagrangian_loss: 0.001946, attention_score_distillation_loss: 0.000925 loss: 0.363066, lagrangian_loss: -0.000205, attention_score_distillation_loss: 0.000915 ---------------------------------------------------------------------- time: 2023-07-19 15:24:35 Evaluating: accuracy: 0.904, eval_loss: 0.415, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.037, expected_sparsity: 0.0348, expected_sequence_sparsity: 0.7742, target_sparsity: 0.0514, step: 9000 lambda_1: 0.5591, lambda_2: 56.8841 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 0.98 0.98 0.74] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 00101111111111111111111110111101111101100111000100 loss: 0.146176, lagrangian_loss: -0.000136, attention_score_distillation_loss: 0.000914 loss: 0.598854, lagrangian_loss: 0.000056, attention_score_distillation_loss: 0.000880 ETA: 1 day, 9:37:52 | Epoch 0 finished. Took 3104.41 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:38:59 Evaluating: accuracy: 0.9067, eval_loss: 0.4672, token_prune_loc: [False, False, False, False, False, False, True, True, True], macs_sparsity: 0.0723, expected_sparsity: 0.0676, expected_sequence_sparsity: 0.7819, target_sparsity: 0.0686, step: 12000 lambda_1: 0.6324, lambda_2: 74.4152 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 0.96 0.95 0.68] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.94, 0.68] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.88, 0.6] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 10111111111111111111111110111111111111111111111110 11111111111111111111101111111111111111101111111110 00101111111111111111111110111101001101100011000100 loss: 0.251421, lagrangian_loss: -0.000140, attention_score_distillation_loss: 0.000883 loss: 0.280520, lagrangian_loss: 0.001688, attention_score_distillation_loss: 0.000867 ---------------------------------------------------------------------- time: 2023-07-19 15:53:21 Evaluating: accuracy: 0.9063, eval_loss: 0.4611, token_prune_loc: [False, False, False, False, False, False, True, True, True], macs_sparsity: 0.0843, expected_sparsity: 0.0754, expected_sequence_sparsity: 0.7837, target_sparsity: 0.0857, step: 15000 lambda_1: -2.0802, lambda_2: 88.1947 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 0.99 0.99 0.94 0.93 0.65] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.92, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.86, 0.55] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 10111111111111111111111110111111111111111111111110 11111111111111111111101110111111111111101111111110 00101111111111111111111110111101001101100010000000 loss: 0.314539, lagrangian_loss: 0.002501, attention_score_distillation_loss: 0.000848 loss: 0.739795, lagrangian_loss: -0.000702, attention_score_distillation_loss: 0.000843 ---------------------------------------------------------------------- time: 2023-07-19 16:07:40 Evaluating: accuracy: 0.9039, eval_loss: 0.4471, token_prune_loc: [False, False, False, False, False, False, True, True, True], macs_sparsity: 0.0843, expected_sparsity: 0.0808, expected_sequence_sparsity: 0.785, target_sparsity: 0.1029, step: 18000 lambda_1: -2.2684, lambda_2: 106.6377 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.99 0.98 0.99 0.94 0.91 0.61] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.9, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.85, 0.52] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 10111111111111111111111110111111111111111111111110 11111111111111111111101110111111111110101111111110 00101111111111111111111110111101000101100010000000 loss: 0.108628, lagrangian_loss: 0.007192, attention_score_distillation_loss: 0.000823 loss: 0.436417, lagrangian_loss: -0.000340, attention_score_distillation_loss: 0.000821 ---------------------------------------------------------------------- time: 2023-07-19 16:22:04 Evaluating: accuracy: 0.9046, eval_loss: 0.4555, token_prune_loc: [False, False, False, False, True, False, True, True, True], macs_sparsity: 0.1243, expected_sparsity: 0.1137, expected_sequence_sparsity: 0.7927, target_sparsity: 0.12, step: 21000 lambda_1: -1.4880, lambda_2: 123.1607 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.99 0.95 0.99 0.94 0.88 0.61] infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 0.94, 0.88, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.86, 0.76, 0.47] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111111111111111111011100 11111111111111111111111111111111111111111111111111 10111111111111111111111110111111111111111111111110 11111111111111111111101110111111111111101011111100 00101111111111111111111110111100000101100010000001 loss: 0.019959, lagrangian_loss: -0.001502, attention_score_distillation_loss: 0.000807 loss: 0.117746, lagrangian_loss: -0.000817, attention_score_distillation_loss: 0.000782 ETA: 1 day, 9:51:18 | Epoch 1 finished. Took 3310.24 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:36:26 Evaluating: accuracy: 0.9047, eval_loss: 0.4748, token_prune_loc: [False, False, False, False, True, False, True, True, True], macs_sparsity: 0.1345, expected_sparsity: 0.1298, expected_sequence_sparsity: 0.7965, target_sparsity: 0.1372, step: 24000 lambda_1: -1.3561, lambda_2: 136.4560 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.98 0.93 0.98 0.93 0.87 0.61] infer remain: [1.0, 1.0, 1.0, 1.0, 0.9, 1.0, 0.92, 0.86, 0.6] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.83, 0.71, 0.43] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111111111111111111010100 11111111111111111111111111111111111111111111111111 10111111111111111111111110111111111111111110111110 11111111111111111111101110111111111110101011111100 00101111111111111111111110111100000101100010000000 loss: 0.153738, lagrangian_loss: 0.000808, attention_score_distillation_loss: 0.000781 loss: 0.031842, lagrangian_loss: -0.001276, attention_score_distillation_loss: 0.000768 ---------------------------------------------------------------------- time: 2023-07-19 16:50:49 Evaluating: accuracy: 0.902, eval_loss: 0.5116, token_prune_loc: [False, False, False, False, True, True, True, True, True], macs_sparsity: 0.1558, expected_sparsity: 0.1487, expected_sequence_sparsity: 0.801, target_sparsity: 0.1543, step: 27000 lambda_1: -0.0448, lambda_2: 151.1529 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.98 0.91 0.96 0.92 0.86 0.59] infer remain: [1.0, 1.0, 1.0, 1.0, 0.9, 0.94, 0.9, 0.86, 0.6] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.85, 0.76, 0.65, 0.39] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111111111111111111010100 11111111111111111111111111111111111111111111101100 10111111111111111111111110111111111111111110101110 11111111111111111111101110111111111110101011111100 00101111111111111111111110111100000100101010000000 loss: 0.524790, lagrangian_loss: 0.000024, attention_score_distillation_loss: 0.000743 loss: 0.263436, lagrangian_loss: 0.000239, attention_score_distillation_loss: 0.000728 ---------------------------------------------------------------------- time: 2023-07-19 17:05:15 Evaluating: accuracy: 0.9038, eval_loss: 0.4867, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.1957, expected_sparsity: 0.1855, expected_sequence_sparsity: 0.8096, target_sparsity: 0.1715, step: 30000 lambda_1: -1.4335, lambda_2: 172.2853 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.96 0.9 0.96 0.89 0.85 0.58] infer remain: [1.0, 1.0, 1.0, 0.94, 0.88, 0.94, 0.88, 0.84, 0.58] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.83, 0.78, 0.68, 0.57, 0.33] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111110011110 11111111111111111111111110111111111111111111000100 11111111111111111111111111111111111111111111101100 10111111111111111111111110111111111111111110101100 11111111111111111111101110111111111110101011110100 00101111111111111111111110111100000100100010000000 loss: 0.208901, lagrangian_loss: -0.001067, attention_score_distillation_loss: 0.000725 loss: 0.145273, lagrangian_loss: 0.001229, attention_score_distillation_loss: 0.000702 ---------------------------------------------------------------------- time: 2023-07-19 17:19:40 Evaluating: accuracy: 0.9055, eval_loss: 0.4734, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.217, expected_sparsity: 0.2022, expected_sequence_sparsity: 0.8135, target_sparsity: 0.1886, step: 33000 lambda_1: -0.5405, lambda_2: 187.6610 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.95 0.88 0.96 0.86 0.86 0.58] infer remain: [1.0, 1.0, 1.0, 0.92, 0.86, 0.94, 0.86, 0.84, 0.58] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.79, 0.74, 0.64, 0.54, 0.31] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111110001110 11111111111111111111111110111111111111111011000100 11111111111111111111111111111111111111111111101100 10111111111111111111111110111111111111111110100100 11111111111111111111101110111111111110101011110100 10101111111111111111111110110100000100100010000000 loss: 0.636113, lagrangian_loss: -0.000260, attention_score_distillation_loss: 0.000695 ETA: 1 day, 9:21:26 | Epoch 2 finished. Took 3322.07 seconds. loss: 0.011457, lagrangian_loss: 0.000860, attention_score_distillation_loss: 0.000686 ---------------------------------------------------------------------- time: 2023-07-19 17:34:04 Evaluating: accuracy: 0.905, eval_loss: 0.4898, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2226, expected_sparsity: 0.2113, expected_sequence_sparsity: 0.8157, target_sparsity: 0.2058, step: 36000 lambda_1: -0.8275, lambda_2: 209.8195 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.93 0.86 0.96 0.85 0.85 0.58] infer remain: [1.0, 1.0, 1.0, 0.92, 0.84, 0.94, 0.84, 0.84, 0.58] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.77, 0.73, 0.61, 0.51, 0.3] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111110001110 11111111111111111111111110111111111111111011000000 11111111111111111111111111111111111111111111101100 10111111111111111111111110111111111011111110100100 11111111111111111111101110111111111110101011110100 00101111111111111111111110110100000100100010100000 loss: 0.038668, lagrangian_loss: 0.000550, attention_score_distillation_loss: 0.000674 loss: 0.019112, lagrangian_loss: -0.000130, attention_score_distillation_loss: 0.000661 ---------------------------------------------------------------------- time: 2023-07-19 17:48:28 Evaluating: accuracy: 0.9035, eval_loss: 0.5023, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2365, expected_sparsity: 0.2281, expected_sequence_sparsity: 0.8196, target_sparsity: 0.2229, step: 39000 lambda_1: -1.4702, lambda_2: 219.3213 lambda_3: 0.0000 train remain: [0.98 1. 1. 0.91 0.84 0.94 0.86 0.85 0.57] infer remain: [1.0, 1.0, 1.0, 0.9, 0.82, 0.92, 0.84, 0.84, 0.58] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.74, 0.68, 0.57, 0.48, 0.28] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111110001100 11111111111111111111111110111111111111111010000000 11111111111111111111111110111111111111111111101100 10111111111111111111111110111111111011111110100100 11111111111111111111101110111111111110101011110100 00101111111111111111111110110100001100100010000000 loss: 0.051066, lagrangian_loss: -0.000569, attention_score_distillation_loss: 0.000648 loss: 0.162792, lagrangian_loss: -0.000987, attention_score_distillation_loss: 0.000631 ---------------------------------------------------------------------- time: 2023-07-19 18:02:45 Evaluating: accuracy: 0.9057, eval_loss: 0.4993, token_prune_loc: [False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2523, expected_sparsity: 0.2417, expected_sequence_sparsity: 0.8228, target_sparsity: 0.2401, step: 42000 lambda_1: -2.3362, lambda_2: 238.0361 lambda_3: 0.0000 train remain: [0.98 1. 0.99 0.9 0.82 0.93 0.85 0.85 0.56] infer remain: [1.0, 1.0, 1.0, 0.88, 0.8, 0.92, 0.84, 0.84, 0.56] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.7, 0.65, 0.54, 0.46, 0.26] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111110000100 11111111111111111111111110111111111111110010000000 11111111111111111111111110111111111111111111101100 10111111111111111111111110111111111011111110100100 11111111111111111111101110111111111110101011110100 00101111111111111111111110110100000100100000000001 loss: 0.134315, lagrangian_loss: -0.002282, attention_score_distillation_loss: 0.000621 loss: 0.673241, lagrangian_loss: 0.001305, attention_score_distillation_loss: 0.000607 ---------------------------------------------------------------------- time: 2023-07-19 18:17:01 Evaluating: accuracy: 0.9045, eval_loss: 0.4668, token_prune_loc: [True, False, False, True, True, True, True, True, True], macs_sparsity: 0.2876, expected_sparsity: 0.2697, expected_sequence_sparsity: 0.8294, target_sparsity: 0.2572, step: 45000 lambda_1: -1.0491, lambda_2: 256.7698 lambda_3: 0.0000 train remain: [0.97 1. 0.99 0.89 0.79 0.93 0.85 0.85 0.55] infer remain: [0.96, 1.0, 1.0, 0.88, 0.78, 0.92, 0.84, 0.84, 0.54] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.96, 0.96, 0.84, 0.66, 0.61, 0.51, 0.43, 0.23] 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111110000100 11111111111111111111111110111111111111010010000000 11111111111111111111111110111111111111111111101100 10111111111111111111111110111111111011111110100100 11111111111111111111101110111111111110101011110100 00101111111111111111111110110100000100100000000000 loss: 0.053032, lagrangian_loss: 0.000619, attention_score_distillation_loss: 0.000598 ETA: 1 day, 8:35:35 | Epoch 3 finished. Took 3300.53 seconds. loss: 0.674372, lagrangian_loss: 0.000077, attention_score_distillation_loss: 0.000574 ---------------------------------------------------------------------- time: 2023-07-19 18:31:17 Evaluating: accuracy: 0.9051, eval_loss: 0.491, token_prune_loc: [True, False, False, True, True, True, True, True, True], macs_sparsity: 0.2903, expected_sparsity: 0.275, expected_sequence_sparsity: 0.8306, target_sparsity: 0.2744, step: 48000 lambda_1: -1.7333, lambda_2: 272.6484 lambda_3: 0.0000 train remain: [0.97 1. 0.98 0.88 0.77 0.93 0.84 0.85 0.54] infer remain: [0.96, 1.0, 1.0, 0.88, 0.76, 0.92, 0.84, 0.84, 0.54] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.96, 0.96, 0.84, 0.64, 0.59, 0.5, 0.42, 0.23] 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111110000100 11111111111111111111111110111111111011010010000000 11111111111111111111111110111111111111111111101100 10111111111111111111111110111111111011111110100100 11111111111111111111101110111111111110101011110100 00101111111111111111111110110100000100100000000000 loss: 0.041096, lagrangian_loss: 0.001828, attention_score_distillation_loss: 0.000557 loss: 0.590846, lagrangian_loss: -0.001445, attention_score_distillation_loss: 0.000560 ---------------------------------------------------------------------- time: 2023-07-19 18:45:34 Evaluating: accuracy: 0.9051, eval_loss: 0.4827, token_prune_loc: [True, False, True, True, True, True, True, True, True], macs_sparsity: 0.3126, expected_sparsity: 0.3016, expected_sequence_sparsity: 0.8369, target_sparsity: 0.2915, step: 51000 lambda_1: -1.8712, lambda_2: 290.0020 lambda_3: 0.0000 train remain: [0.97 0.99 0.95 0.86 0.77 0.92 0.84 0.84 0.54] infer remain: [0.96, 1.0, 0.94, 0.86, 0.76, 0.92, 0.84, 0.84, 0.54] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.96, 0.9, 0.78, 0.59, 0.54, 0.46, 0.38, 0.21] 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111110111100 11111111111111111111111111111111111111110110000100 11111111111111111111111110111111111011010010000000 11111111111111111111111110111111111111111111101100 10111111111111111111111110111111111011111110100100 11111111111111111111101110111111111110101011110100 00101111111111111111111110010100001100100000000000 loss: 0.387114, lagrangian_loss: -0.001067, attention_score_distillation_loss: 0.000547 loss: 0.360754, lagrangian_loss: 0.003099, attention_score_distillation_loss: 0.000520 ---------------------------------------------------------------------- time: 2023-07-19 18:59:58 Evaluating: accuracy: 0.901, eval_loss: 0.4962, token_prune_loc: [True, False, True, True, True, True, True, True, True], macs_sparsity: 0.3395, expected_sparsity: 0.3218, expected_sequence_sparsity: 0.8417, target_sparsity: 0.3087, step: 54000 lambda_1: -0.3097, lambda_2: 309.7510 lambda_3: 0.0000 train remain: [0.97 0.99 0.94 0.85 0.75 0.91 0.83 0.84 0.53] infer remain: [0.96, 1.0, 0.92, 0.84, 0.74, 0.92, 0.82, 0.84, 0.52] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.96, 0.88, 0.74, 0.55, 0.51, 0.41, 0.35, 0.18] 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111110101100 11111111111111111111111111111111111111110110000000 11111111111111111111111110111111111010010010000000 11111111111111111111111110111111111111111111101100 10111111111111111111111110111111111011111110100000 11111111111111111111101110111111111110101011110100 00101111111111111111111110010100000100100000000000 loss: 0.595147, lagrangian_loss: 0.000261, attention_score_distillation_loss: 0.000511 loss: 0.314396, lagrangian_loss: -0.000247, attention_score_distillation_loss: 0.000506 ETA: 1 day, 7:22:20 | Epoch 4 finished. Took 3097.1 seconds. ---------------------------------------------------------------------- time: 2023-07-19 19:14:26 Evaluating: accuracy: 0.898, eval_loss: 0.5246, token_prune_loc: [True, False, True, True, True, True, True, True, True], macs_sparsity: 0.3524, expected_sparsity: 0.3308, expected_sequence_sparsity: 0.8438, target_sparsity: 0.3258, step: 57000 lambda_1: -3.3939, lambda_2: 326.0374 lambda_3: 0.0000 train remain: [0.97 0.99 0.92 0.82 0.75 0.9 0.83 0.84 0.51] infer remain: [0.96, 1.0, 0.92, 0.82, 0.74, 0.9, 0.82, 0.84, 0.5] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.96, 0.88, 0.72, 0.54, 0.48, 0.4, 0.33, 0.17] 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111110101100 11111111111111111111111111111111111111100110000000 11111111111111111111111110111111111010010010000000 11111111111111111111111110111111111111111111100100 10111111111111111111111110111111111011111110100000 11111111111111111111101110111111111110101011110100 00101111111111111111111110010100000000100000000000 loss: 0.010384, lagrangian_loss: -0.000795, attention_score_distillation_loss: 0.000491 loss: 0.819248, lagrangian_loss: 0.000366, attention_score_distillation_loss: 0.000479 ---------------------------------------------------------------------- time: 2023-07-19 19:28:55 Evaluating: accuracy: 0.8998, eval_loss: 0.5425, token_prune_loc: [True, False, True, True, True, True, True, True, True], macs_sparsity: 0.3635, expected_sparsity: 0.3468, expected_sequence_sparsity: 0.8475, target_sparsity: 0.343, step: 60000 lambda_1: -0.7647, lambda_2: 344.1528 lambda_3: 0.0000 train remain: [0.97 0.99 0.91 0.81 0.75 0.87 0.83 0.81 0.49] infer remain: [0.96, 1.0, 0.9, 0.8, 0.74, 0.88, 0.82, 0.82, 0.48] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.96, 0.86, 0.69, 0.51, 0.45, 0.37, 0.3, 0.15] 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111110001100 11111111111111111111111111111111111111100010000000 11111111111111111111111110111111111010010010000000 10111111111111111111111110111111111111111111100100 10111111111111111111111110111111111011111110100000 10111111111111111111101110111111111110101011110100 00101111111111111111111010010100000000100000000000 loss: 0.371095, lagrangian_loss: -0.000127, attention_score_distillation_loss: 0.000465 loss: 0.383557, lagrangian_loss: -0.007696, attention_score_distillation_loss: 0.000452 ---------------------------------------------------------------------- time: 2023-07-19 19:43:28 Evaluating: accuracy: 0.8984, eval_loss: 0.5257, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.3848, expected_sparsity: 0.367, expected_sequence_sparsity: 0.8523, target_sparsity: 0.3601, step: 63000 lambda_1: -2.5047, lambda_2: 366.3772 lambda_3: 0.0000 train remain: [0.97 0.98 0.89 0.79 0.73 0.87 0.82 0.79 0.48] infer remain: [0.96, 0.98, 0.88, 0.8, 0.72, 0.86, 0.82, 0.8, 0.48] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.83, 0.66, 0.48, 0.41, 0.34, 0.27, 0.13] 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111110001000 11111111111111111111111111111111111111100010000000 11111111111111111111111110111111111010010000000000 10111111111111111111111110111111111111111110100100 10111111111111111111111110111111111011111110100000 10111111111111111111101110111111111110101011110000 00111111110111111111111010010100000000100000000000 loss: 0.127202, lagrangian_loss: 0.000266, attention_score_distillation_loss: 0.000439 loss: 0.121989, lagrangian_loss: 0.001641, attention_score_distillation_loss: 0.000421 ---------------------------------------------------------------------- time: 2023-07-19 19:57:55 Evaluating: accuracy: 0.8995, eval_loss: 0.4993, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4042, expected_sparsity: 0.3851, expected_sequence_sparsity: 0.8565, target_sparsity: 0.3773, step: 66000 lambda_1: -2.7617, lambda_2: 379.9322 lambda_3: 0.0000 train remain: [0.97 0.98 0.85 0.77 0.72 0.86 0.82 0.76 0.48] infer remain: [0.96, 0.98, 0.84, 0.78, 0.72, 0.86, 0.82, 0.76, 0.48] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.79, 0.62, 0.44, 0.38, 0.31, 0.24, 0.11] 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111111111110 11111111111111111111111110111111111111011110001000 11111111111111111111111111111111111111100000000000 11111111111111111111111110111111111010010000000000 10111111111111111111111110111111111111111110100100 10111111111111111111111110111111111011111110100000 10111111111111111111101110111111111010001011110000 00111111110111111111111010010100000000100000000000 loss: 0.031277, lagrangian_loss: -0.003024, attention_score_distillation_loss: 0.000413 loss: 0.289318, lagrangian_loss: -0.005275, attention_score_distillation_loss: 0.000401 ETA: 1 day, 6:39:01 | Epoch 5 finished. Took 3337.65 seconds. ---------------------------------------------------------------------- time: 2023-07-19 20:12:22 Evaluating: accuracy: 0.8951, eval_loss: 0.5208, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4144, expected_sparsity: 0.398, expected_sequence_sparsity: 0.8596, target_sparsity: 0.3944, step: 69000 lambda_1: -2.4025, lambda_2: 396.2157 lambda_3: 0.0000 train remain: [0.97 0.98 0.84 0.75 0.69 0.86 0.82 0.73 0.48] infer remain: [0.96, 0.98, 0.84, 0.76, 0.68, 0.86, 0.82, 0.72, 0.48] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.79, 0.6, 0.41, 0.35, 0.29, 0.21, 0.1] 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111111111110 11111111111111111111111110111111111111011110001000 11111111111111111111111111111111111111000000000000 10111111111111111111111110111111111010000000000000 10111111111111111111111110111111111111111110100100 10111111111111111111111110111111111011111110100000 00111111111111111111101110111111011010001011110000 00101111110111111111111010010100000000110000000000 loss: 0.346848, lagrangian_loss: 0.008622, attention_score_distillation_loss: 0.000380 loss: 0.120472, lagrangian_loss: 0.007903, attention_score_distillation_loss: 0.000367 ---------------------------------------------------------------------- time: 2023-07-19 20:26:48 Evaluating: accuracy: 0.9013, eval_loss: 0.4974, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4255, expected_sparsity: 0.4128, expected_sequence_sparsity: 0.8631, target_sparsity: 0.4116, step: 72000 lambda_1: -6.2552, lambda_2: 414.8770 lambda_3: 0.0000 train remain: [0.96 0.98 0.82 0.73 0.67 0.84 0.82 0.7 0.44] infer remain: [0.96, 0.98, 0.82, 0.74, 0.68, 0.82, 0.82, 0.7, 0.44] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.77, 0.57, 0.39, 0.32, 0.26, 0.18, 0.08] 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111111111110 11111111111111111111111110111111111111011100001000 11111111111111111111111110111111111111000000000000 10111111111111111111111110111111111010000000000000 10111111111111111111111110111111111111110110100000 10111111111111111111111110111111111011111110100000 00111111111111111111101110111111011010001010110000 00001111110111111111011010010100000000100000000001 loss: 0.267200, lagrangian_loss: 0.009891, attention_score_distillation_loss: 0.000353 loss: 0.012345, lagrangian_loss: -0.000463, attention_score_distillation_loss: 0.000349 ---------------------------------------------------------------------- time: 2023-07-19 20:41:21 Evaluating: accuracy: 0.8968, eval_loss: 0.5501, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4551, expected_sparsity: 0.4367, expected_sequence_sparsity: 0.8687, target_sparsity: 0.4287, step: 75000 lambda_1: -4.8097, lambda_2: 432.0788 lambda_3: 0.0000 train remain: [0.96 0.97 0.79 0.73 0.67 0.83 0.82 0.67 0.39] infer remain: [0.96, 0.96, 0.78, 0.72, 0.66, 0.82, 0.82, 0.68, 0.4] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.92, 0.72, 0.52, 0.34, 0.28, 0.23, 0.16, 0.06] 11111111111111111111111111111111111111111111111100 11111111111111111111111110111111111111111111111110 11111111111111111111111110111111111111011000000000 11111111111111111111111110111111111110000000000000 10111111111111111111111110111111101010000000000000 10111111111111111111111110111111111111110110100000 10111111111111111111111110111111111011111110100000 00111111111111111111101110111111011010001000110000 00001111110111101111011010010100000000100000000000 loss: 0.197921, lagrangian_loss: 0.009890, attention_score_distillation_loss: 0.000330 loss: 0.920879, lagrangian_loss: -0.004288, attention_score_distillation_loss: 0.000323 ---------------------------------------------------------------------- time: 2023-07-19 20:55:51 Evaluating: accuracy: 0.8948, eval_loss: 0.4981, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4661, expected_sparsity: 0.4508, expected_sequence_sparsity: 0.872, target_sparsity: 0.4459, step: 78000 lambda_1: -3.2839, lambda_2: 450.0362 lambda_3: 0.0000 train remain: [0.96 0.97 0.76 0.71 0.64 0.81 0.82 0.64 0.36] infer remain: [0.96, 0.96, 0.76, 0.7, 0.64, 0.8, 0.82, 0.64, 0.36] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.92, 0.7, 0.49, 0.31, 0.25, 0.21, 0.13, 0.05] 11111111111111111111111111111111111111111111111100 11111111111111111111111110111111111111111111111110 11111111111111111111111110111111111011011000000000 10111111111111111111111110111111111110000000000000 10111111111111111111111110111111101000000000000000 10111111111111111111111110111111111011110110100000 10111111111111111111111110111111111011111110100000 00101111111111111111101110111111011000001000110000 00001111110110101011011010010000000000100001000000 loss: 0.069720, lagrangian_loss: -0.005718, attention_score_distillation_loss: 0.000312 loss: 0.394201, lagrangian_loss: 0.000193, attention_score_distillation_loss: 0.000298 ETA: 1 day, 5:52:26 | Epoch 6 finished. Took 3340.85 seconds. ---------------------------------------------------------------------- time: 2023-07-19 21:10:18 Evaluating: accuracy: 0.8954, eval_loss: 0.5641, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4874, expected_sparsity: 0.4685, expected_sequence_sparsity: 0.8761, target_sparsity: 0.463, step: 81000 lambda_1: -1.6612, lambda_2: 467.1519 lambda_3: 0.0000 train remain: [0.97 0.95 0.74 0.69 0.63 0.79 0.74 0.59 0.35] infer remain: [0.96, 0.94, 0.74, 0.7, 0.62, 0.78, 0.74, 0.58, 0.34] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.9, 0.67, 0.47, 0.29, 0.23, 0.17, 0.1, 0.03] 11111111111111111111111111111111111111111111111100 11111111111111111111111110111111111111111111111100 11111111111111111111111110111111101011011000000000 10111111111111111111111110111111111110000000000000 10111111111111111111111110111111001000000000000000 10011111111111111111111110111111111011110110100000 10011101110111111011111110111111111011111110100000 00101111110111111011101010111111011000001000110000 00001111010110101011011010010000000001100000000000 loss: 0.779751, lagrangian_loss: -0.000404, attention_score_distillation_loss: 0.000285 loss: 0.391330, lagrangian_loss: 0.009232, attention_score_distillation_loss: 0.000269 ---------------------------------------------------------------------- time: 2023-07-19 21:24:47 Evaluating: accuracy: 0.8982, eval_loss: 0.5389, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4985, expected_sparsity: 0.4826, expected_sequence_sparsity: 0.8794, target_sparsity: 0.4802, step: 84000 lambda_1: -4.5302, lambda_2: 485.2544 lambda_3: 0.0000 train remain: [0.96 0.94 0.72 0.66 0.61 0.78 0.71 0.49 0.35] infer remain: [0.96, 0.94, 0.72, 0.66, 0.62, 0.76, 0.7, 0.48, 0.34] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.9, 0.65, 0.43, 0.27, 0.2, 0.14, 0.07, 0.02] 11111111111111111111111111111111111111111111111100 11111111111111111111111110111111111111111111111100 11111111111111111111111110111111101011010000000000 10111111111111111111111110111111110100000000000000 10111111111111111111111110111111001000000000000000 10011111111111111111111110111111011011110110100000 10011101110111111011111110111111011011110110100000 00001101110111111011101010011101011000000000110000 00001111010110101011011010010000000000100001000000 loss: 0.184746, lagrangian_loss: 0.000812, attention_score_distillation_loss: 0.000253 loss: 0.169796, lagrangian_loss: -0.004998, attention_score_distillation_loss: 0.000246 ---------------------------------------------------------------------- time: 2023-07-19 21:39:16 Evaluating: accuracy: 0.8931, eval_loss: 0.532, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5179, expected_sparsity: 0.5012, expected_sequence_sparsity: 0.8838, target_sparsity: 0.4973, step: 87000 lambda_1: -4.9811, lambda_2: 503.5086 lambda_3: 0.0000 train remain: [0.95 0.94 0.71 0.62 0.61 0.73 0.67 0.45 0.32] infer remain: [0.94, 0.94, 0.7, 0.62, 0.62, 0.72, 0.66, 0.44, 0.32] layerwise remain: [1.0, 1.0, 1.0, 0.94, 0.88, 0.62, 0.38, 0.24, 0.17, 0.11, 0.05, 0.02] 11111111111111111111111111111111111111111111111000 11111111111111111111111110111111111111111111111100 11111111111111111111111110111111101010010000000000 10111111111111111111111110111111100000000000000000 10111111111111111111111110111110001000000000100000 10011111111111111011111110011111011011110110100000 10011101110111111011111010111101011011110110100000 00001101110111101011001010011101011000000000110000 00001111010110101011011010010000000000100000000000 loss: 0.455696, lagrangian_loss: 0.014450, attention_score_distillation_loss: 0.000231 loss: 0.308696, lagrangian_loss: 0.001392, attention_score_distillation_loss: 0.000217 ---------------------------------------------------------------------- time: 2023-07-19 21:53:47 Evaluating: accuracy: 0.8944, eval_loss: 0.4786, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.529, expected_sparsity: 0.5139, expected_sequence_sparsity: 0.8868, target_sparsity: 0.5145, step: 90000 lambda_1: -4.3059, lambda_2: 521.4919 lambda_3: 0.0000 train remain: [0.93 0.93 0.67 0.62 0.6 0.69 0.63 0.41 0.31] infer remain: [0.92, 0.94, 0.68, 0.62, 0.6, 0.7, 0.62, 0.42, 0.32] layerwise remain: [1.0, 1.0, 1.0, 0.92, 0.86, 0.59, 0.36, 0.22, 0.15, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111011111000 11111111111111111111111110111111111111111111111100 11111111111111111111111110111111101010000000000000 10111111111111111111111110111110101000000000000000 10111111111111111111111110111110001000000000000000 10011111111111111011111110011101011011110110100000 10001101110111111011111010011101011011110110100000 00001100110111101011001010011100011000000100110000 00000111010110101011011010000000000100100001000000 loss: 0.593622, lagrangian_loss: 0.001316, attention_score_distillation_loss: 0.000204 ETA: 1 day, 5:03:27 | Epoch 7 finished. Took 3338.89 seconds. loss: 0.123990, lagrangian_loss: 0.002299, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-19 22:08:20 Evaluating: accuracy: 0.8959, eval_loss: 0.5149, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.554, expected_sparsity: 0.5383, expected_sequence_sparsity: 0.8925, target_sparsity: 0.5316, step: 93000 lambda_1: -7.9677, lambda_2: 538.8955 lambda_3: 0.0000 train remain: [0.89 0.91 0.67 0.62 0.6 0.67 0.6 0.37 0.3 ] infer remain: [0.88, 0.9, 0.66, 0.62, 0.6, 0.66, 0.6, 0.38, 0.3] layerwise remain: [1.0, 1.0, 1.0, 0.88, 0.79, 0.52, 0.32, 0.19, 0.13, 0.08, 0.03, 0.01] 11111111111111111111111110111111111111111011110000 11111111111111111111111110111111111111111101110100 11111111111111111111111110111111101000000000000000 10111111111111111111111110111110100000000000000100 10111111111111111111111110111100001000000100000000 10001111110111111011111110011101011011110110100000 10001101110111111011011010011101011011110110100000 00001100110110101011001010011100011000000100100000 10000111010110101011011010000000000000100000000000 loss: 0.035226, lagrangian_loss: -0.010783, attention_score_distillation_loss: 0.000178 loss: 0.611543, lagrangian_loss: 0.013423, attention_score_distillation_loss: 0.000163 ---------------------------------------------------------------------- time: 2023-07-19 22:22:52 Evaluating: accuracy: 0.8924, eval_loss: 0.4819, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5623, expected_sparsity: 0.5483, expected_sequence_sparsity: 0.8949, target_sparsity: 0.5488, step: 96000 lambda_1: -4.6683, lambda_2: 556.6636 lambda_3: 0.0000 train remain: [0.86 0.9 0.64 0.61 0.6 0.61 0.58 0.35 0.3 ] infer remain: [0.86, 0.9, 0.64, 0.62, 0.6, 0.62, 0.58, 0.36, 0.3] layerwise remain: [1.0, 1.0, 1.0, 0.86, 0.77, 0.5, 0.31, 0.18, 0.11, 0.07, 0.02, 0.01] 11111111111111111111111110111111111111111011010000 11111111111111111111111110111111111111111101110100 11111111111111111111111110110111101000000000000000 10111111111111111111111110111110100000000000100000 10111111111111111111111110111100001001000000000000 10001111110111101011111010011101011011110110100000 10001101110111101011011010011101011011110110100000 10001100110110101011001010010100011000000000100000 10000111010010101011011010000001000000100000000000 loss: 0.231981, lagrangian_loss: 0.007939, attention_score_distillation_loss: 0.000153 loss: 0.162509, lagrangian_loss: -0.010745, attention_score_distillation_loss: 0.000140 ---------------------------------------------------------------------- time: 2023-07-19 22:37:21 Evaluating: accuracy: 0.8893, eval_loss: 0.5444, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.579, expected_sparsity: 0.5658, expected_sequence_sparsity: 0.899, target_sparsity: 0.5659, step: 99000 lambda_1: -5.4141, lambda_2: 573.7689 lambda_3: 0.0000 train remain: [0.85 0.86 0.63 0.6 0.54 0.6 0.56 0.35 0.27] infer remain: [0.84, 0.86, 0.64, 0.6, 0.54, 0.6, 0.56, 0.36, 0.26] layerwise remain: [1.0, 1.0, 1.0, 0.84, 0.72, 0.46, 0.28, 0.15, 0.09, 0.05, 0.02, 0.0] 11111111111111111111111110111111111111110011010000 10111111111111111111011110111111111111111101110100 11111111111111111111111110110111100010000000000000 10111111111111111111111110111110100000000000000000 10011111111111111011111110111100001000000000000000 10001111110111101011011010011101011011110110100000 10001101110110101011011010011101011001110111100000 00001100110110101001001010010100011000010100100000 10000111010010101011011010000000000000000000000000 loss: 0.315823, lagrangian_loss: 0.000172, attention_score_distillation_loss: 0.000128 loss: 0.141669, lagrangian_loss: -0.009282, attention_score_distillation_loss: 0.000114 ---------------------------------------------------------------------- time: 2023-07-19 22:51:51 Evaluating: accuracy: 0.8854, eval_loss: 0.5078, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5929, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.9027, target_sparsity: 0.5831, step: 102000 lambda_1: -7.4074, lambda_2: 590.8319 lambda_3: 0.0000 train remain: [0.82 0.83 0.6 0.6 0.5 0.58 0.56 0.34 0.23] infer remain: [0.82, 0.84, 0.6, 0.6, 0.5, 0.58, 0.56, 0.34, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.82, 0.69, 0.41, 0.25, 0.12, 0.07, 0.04, 0.01, 0.0] 11111111111111111111111110111111111111110001010000 10111111111111111111011110111111111110111101110100 10111111111111111111111110110111100000000000000000 10111111111111111111111110111110001000000000000000 10011111111111111011111010011100001000000000000000 10001011110111101011011010011101011011110110100000 10001101110110101011011010011101011001110111100000 00001100110110101001001010010100011001000000100000 10000011010010101011010010000000000001000000000000 loss: 0.289435, lagrangian_loss: 0.002069, attention_score_distillation_loss: 0.000101 ETA: 1 day, 4:13:19 | Epoch 8 finished. Took 3345.01 seconds. loss: 0.402243, lagrangian_loss: -0.007613, attention_score_distillation_loss: 0.000089 ---------------------------------------------------------------------- time: 2023-07-19 23:06:16 Evaluating: accuracy: 0.8868, eval_loss: 0.5376, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.615, expected_sparsity: 0.6017, expected_sequence_sparsity: 0.9074, target_sparsity: 0.6002, step: 105000 lambda_1: -8.2756, lambda_2: 607.7910 lambda_3: 0.0000 train remain: [0.81 0.78 0.56 0.6 0.46 0.54 0.54 0.3 0.22] infer remain: [0.8, 0.78, 0.56, 0.6, 0.46, 0.54, 0.54, 0.3, 0.22] layerwise remain: [1.0, 1.0, 1.0, 0.8, 0.62, 0.35, 0.21, 0.1, 0.05, 0.03, 0.01, 0.0] 10111111111111111111111110111111111111110001010000 10111111110111111111011110111111111110010101110100 10111111111111111111111110110110000000000000000000 10111111111111111111111110111111000000000000000000 10011111110111101011011010011100001001000000000000 10000011110110101011011010011101011011110110100000 10001101110110101011011010011101010001110110100001 00000100110110101001001010010100011000000000100000 00000011010010101011010010000000010000000000000000 loss: 0.066504, lagrangian_loss: 0.006595, attention_score_distillation_loss: 0.000074 loss: 0.568262, lagrangian_loss: -0.009137, attention_score_distillation_loss: 0.000063 ---------------------------------------------------------------------- time: 2023-07-19 23:20:42 Evaluating: accuracy: 0.8866, eval_loss: 0.5264, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6289, expected_sparsity: 0.6162, expected_sequence_sparsity: 0.9109, target_sparsity: 0.6173, step: 108000 lambda_1: -7.4469, lambda_2: 626.0708 lambda_3: 0.0000 train remain: [0.78 0.74 0.53 0.58 0.44 0.48 0.45 0.24 0.18] infer remain: [0.78, 0.74, 0.54, 0.58, 0.44, 0.48, 0.44, 0.24, 0.16] layerwise remain: [1.0, 1.0, 1.0, 0.78, 0.58, 0.31, 0.18, 0.08, 0.04, 0.02, 0.0, 0.0] 10111111111111111111111110111111111111110001000000 10111111110111111111011110111101111110010100110100 10111111111111111111111110110100000000000000000000 10111111111111111111111110111110000000000000000000 10011111110111101011011010011100001000000000000000 10000011110110101011011010011101010001010110100000 10000001110110101011010010011101010001010110000001 10000000110010101001001010000100010001000000000000 10000010010010001001000010000000010000000000000000 loss: 0.894482, lagrangian_loss: 0.011798, attention_score_distillation_loss: 0.000049 loss: 0.571284, lagrangian_loss: 0.014897, attention_score_distillation_loss: 0.000036 ---------------------------------------------------------------------- time: 2023-07-19 23:35:04 Evaluating: accuracy: 0.8851, eval_loss: 0.5031, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6483, expected_sparsity: 0.6345, expected_sequence_sparsity: 0.9152, target_sparsity: 0.6345, step: 111000 lambda_1: -11.8927, lambda_2: 643.7094 lambda_3: 0.0000 train remain: [0.75 0.71 0.51 0.5 0.41 0.44 0.41 0.22 0.3 ] infer remain: [0.74, 0.72, 0.5, 0.5, 0.4, 0.44, 0.4, 0.22, 0.18] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.53, 0.27, 0.13, 0.05, 0.02, 0.01, 0.0, 0.0] 00111111111111111111111110111111111111110000000000 10111111110011111111011110111101111110010100110100 10111111111111111111101110100100000000000000000000 10011111111111101011111110111100000000000000000000 10001111110110101011011010011100001000000000000000 10000011110010101011011010011101010001010110000000 10000001110010101011010010001101010001010110000001 10000000110010101001000010000100010001000000000000 10000010010010001001000010000000010001000000000000 loss: 0.458527, lagrangian_loss: -0.009432, attention_score_distillation_loss: 0.000024 loss: 0.104331, lagrangian_loss: -0.014244, attention_score_distillation_loss: 0.000011 ETA: 1 day, 3:10:17 | Epoch 9 finished. Took 3108.96 seconds. Starting saving the best from epoch 10 and step 114000 ---------------------------------------------------------------------- time: 2023-07-19 23:49:26 Evaluating: accuracy: 0.8817, eval_loss: 0.5588, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6502, expected_sequence_sparsity: 0.9188, target_sparsity: 0.65, step: 114000 lambda_1: -9.3703, lambda_2: 665.1255 lambda_3: 0.0000 train remain: [0.72 0.66 0.46 0.44 0.39 0.37 0.37 0.23 0.45] infer remain: [0.72, 0.66, 0.46, 0.44, 0.38, 0.36, 0.36, 0.22, 0.16] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.04, 0.01, 0.0, 0.0, 0.0] 00111111111111111111111110111111111110110000000000 10101111110011111111010110111101111010010100110100 10111111111111111011101110100000000000000000000000 10001111110111101011111110011100000000000000000000 10000111110110101011011010011100000000000100000000 10000001110010101001010010001101010001010110000000 10000001110010101001010010001100010001010110000001 10000000110010101001000010000100010001000000000000 10000000010010001001000010000000010001000000000000 Saving the best model so far: [Epoch 10 | Step: 114000 | MACs sparsity: 0.6622 | Score: 0.8817 | Loss: 0.5588] loss: 0.636109, lagrangian_loss: -0.008807, attention_score_distillation_loss: 0.000010 loss: 0.246043, lagrangian_loss: 0.000224, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 00:04:02 Evaluating: accuracy: 0.8825, eval_loss: 0.5941, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6622, expected_sparsity: 0.65, expected_sequence_sparsity: 0.9188, target_sparsity: 0.65, step: 117000 lambda_1: -2.0260, lambda_2: 680.1785 lambda_3: 0.0000 train remain: [0.72 0.66 0.46 0.44 0.38 0.35 0.37 0.22 0.76] infer remain: [0.72, 0.66, 0.46, 0.44, 0.38, 0.36, 0.36, 0.22, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.04, 0.01, 0.0, 0.0, 0.0] 00111111111111111111111110111111111110110000000000 10101111110011111111010110111101111010110100100100 10111111111111111011100110100000000000000000001000 10001111110111101011011110011100000001000000000000 10000111110110101011011010011100000000000100000000 10000011110010101001010010001101010001010100000000 10000001110010101001010010001100010001010110000001 10000000110010101001000010000100010001000000000000 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8817 @ step 114000 epoch 10.03 Saving the best model so far: [Epoch 10 | Step: 117000 | MACs sparsity: 0.6622 | Score: 0.8825 | Loss: 0.5941] loss: 0.222651, lagrangian_loss: 0.012709, attention_score_distillation_loss: 0.000010 loss: 0.130893, lagrangian_loss: 0.000657, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 00:19:01 Evaluating: accuracy: 0.8886, eval_loss: 0.5467, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6622, expected_sparsity: 0.6503, expected_sequence_sparsity: 0.9189, target_sparsity: 0.65, step: 120000 lambda_1: -3.4723, lambda_2: 696.2979 lambda_3: 0.0000 train remain: [0.72 0.66 0.45 0.44 0.37 0.33 0.31 0.22 0.84] infer remain: [0.72, 0.66, 0.46, 0.44, 0.38, 0.32, 0.3, 0.22, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.04, 0.01, 0.0, 0.0, 0.0] 00111111111111111111111110111111111110110000000000 10111111110011111111010110111101111010010100100100 10111111111111111111100110100000000000000000000000 10001111110111101011011110011100000001000000000000 10000111110110101011011010001100000000010010000000 10000001110010101001010010000100010001010100000001 10000000110010101001010010000100010001010100000001 10000000110010001001000010000100010001010000000000 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8825 @ step 117000 epoch 10.29 Saving the best model so far: [Epoch 10 | Step: 120000 | MACs sparsity: 0.6622 | Score: 0.8886 | Loss: 0.5467] loss: 0.149866, lagrangian_loss: -0.003486, attention_score_distillation_loss: 0.000010 loss: 0.057800, lagrangian_loss: -0.000276, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 00:33:44 Evaluating: accuracy: 0.8865, eval_loss: 0.4985, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6505, expected_sequence_sparsity: 0.9189, target_sparsity: 0.65, step: 123000 lambda_1: -1.6749, lambda_2: 713.5250 lambda_3: 0.0000 train remain: [0.72 0.66 0.46 0.44 0.36 0.31 0.29 0.23 0.93] infer remain: [0.72, 0.66, 0.46, 0.44, 0.36, 0.32, 0.3, 0.22, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0] 00111111111111111111111110111111111110110000000000 11101111110011111111010110111101111010010100100100 10111111111111111011100110100100000000000000000000 10001111110111101011011010011101000001000000000000 10000111110110101011011010001100000000010000000000 10000001110010101001010010000100010001010100000001 10000000110010101001000010000100010001010100000011 10000000110010001001000010000100010001010000000000 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8886 @ step 120000 epoch 10.55 loss: 0.205889, lagrangian_loss: 0.005737, attention_score_distillation_loss: 0.000010 loss: 0.477401, lagrangian_loss: 0.000599, attention_score_distillation_loss: 0.000010 ETA: 1 day, 2:21:12 | Epoch 10 finished. Took 3380.22 seconds. ---------------------------------------------------------------------- time: 2023-07-20 00:48:01 Evaluating: accuracy: 0.8843, eval_loss: 0.5702, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6509, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 126000 lambda_1: -1.5006, lambda_2: 730.7234 lambda_3: 0.0000 train remain: [0.72 0.66 0.46 0.44 0.34 0.3 0.28 0.24 0.95] infer remain: [0.72, 0.66, 0.46, 0.44, 0.34, 0.3, 0.28, 0.24, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0] 00111111111111111111111110111111111110110000000000 10111111110011111111010110111101111010010100100100 10111111111111111111110110000000000000000000000000 10001111110111101011011010011101000001000000000000 10000111110010101011011010001100000000010000000000 10000001110010101001010010000100010001010100000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010000000001 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8886 @ step 120000 epoch 10.55 loss: 0.278874, lagrangian_loss: -0.000694, attention_score_distillation_loss: 0.000010 loss: 0.249959, lagrangian_loss: 0.008280, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 01:02:23 Evaluating: accuracy: 0.8899, eval_loss: 0.5189, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6509, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 129000 lambda_1: -1.2810, lambda_2: 747.8193 lambda_3: 0.0000 train remain: [0.72 0.66 0.46 0.45 0.34 0.3 0.28 0.25 0.96] infer remain: [0.72, 0.66, 0.46, 0.44, 0.34, 0.3, 0.28, 0.24, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0] 00111111111111111111111110111111111110110000000000 10111111110011111111010110111101111010010100100100 10111111111111111111100110100000000000000000000000 10001111110111101011011010011100000101000000000000 10000111110010101011010010001100000000010010000000 10000001110010101001000010000100010001010110000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8886 @ step 120000 epoch 10.55 Saving the best model so far: [Epoch 11 | Step: 129000 | MACs sparsity: 0.6649 | Score: 0.8899 | Loss: 0.5189] loss: 0.531214, lagrangian_loss: -0.000237, attention_score_distillation_loss: 0.000010 loss: 0.436766, lagrangian_loss: 0.002168, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 01:16:59 Evaluating: accuracy: 0.8302, eval_loss: 0.7375, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6509, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 132000 lambda_1: -1.7529, lambda_2: 764.9451 lambda_3: 0.0000 train remain: [0.72 0.66 0.46 0.45 0.34 0.29 0.28 0.25 0.96] infer remain: [0.72, 0.66, 0.46, 0.44, 0.34, 0.3, 0.28, 0.24, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0] 00111111111111111111111110111111111110110000000000 10111111110011111111010110111101111010010100100100 10111111111111111111100110010000000000000000000000 10001111110111101011011010011101000001000000000000 10000111110010101011010010001100000000010100000000 10000000110010101001000010001100010001010110000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010000000001 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8899 @ step 129000 epoch 11.34 loss: 0.528980, lagrangian_loss: -0.000636, attention_score_distillation_loss: 0.000010 loss: 0.455666, lagrangian_loss: 0.015322, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 01:31:19 Evaluating: accuracy: 0.8904, eval_loss: 0.495, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6506, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 135000 lambda_1: -1.3714, lambda_2: 782.2104 lambda_3: 0.0000 train remain: [0.72 0.66 0.47 0.46 0.36 0.29 0.28 0.25 0.96] infer remain: [0.72, 0.66, 0.46, 0.44, 0.36, 0.3, 0.28, 0.24, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0] 00111111111111111111111110111111111110100000010000 10101111110011111111010110111101111010010110100100 10111111111111111111100110010000000000000000000000 10001111110111101011011010011100010001000000000000 10000111110010101011010010001100010001010000000000 10000000110010101001000010000101010001010110000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8899 @ step 129000 epoch 11.34 Saving the best model so far: [Epoch 11 | Step: 135000 | MACs sparsity: 0.6649 | Score: 0.8904 | Loss: 0.495] loss: 0.242830, lagrangian_loss: 0.000136, attention_score_distillation_loss: 0.000010 ETA: 1 day, 1:29:56 | Epoch 11 finished. Took 3355.22 seconds. loss: 0.639652, lagrangian_loss: 0.003957, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 01:46:13 Evaluating: accuracy: 0.8949, eval_loss: 0.5042, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6507, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 138000 lambda_1: -2.2353, lambda_2: 799.8128 lambda_3: 0.0000 train remain: [0.72 0.66 0.46 0.46 0.37 0.29 0.28 0.25 0.96] infer remain: [0.72, 0.66, 0.46, 0.44, 0.36, 0.28, 0.28, 0.24, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0] 01111111111111111111111110111111111110100000000000 10111111110011111111010110111101111010010100100100 10111111111111111111100110010000000000000000000000 10001111110111101011011010011100010001000000000000 10000111110010101011010010001100010001010000000000 10000000110010101001000010000100010001010110000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8904 @ step 135000 epoch 11.87 Saving the best model so far: [Epoch 12 | Step: 138000 | MACs sparsity: 0.6649 | Score: 0.8949 | Loss: 0.5042] loss: 0.164816, lagrangian_loss: -0.000475, attention_score_distillation_loss: 0.000010 loss: 0.077830, lagrangian_loss: -0.001258, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 02:01:08 Evaluating: accuracy: 0.8944, eval_loss: 0.5021, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6622, expected_sparsity: 0.6502, expected_sequence_sparsity: 0.9188, target_sparsity: 0.65, step: 141000 lambda_1: -1.8924, lambda_2: 816.6750 lambda_3: 0.0000 train remain: [0.72 0.66 0.46 0.46 0.36 0.29 0.28 0.27 0.96] infer remain: [0.72, 0.66, 0.46, 0.46, 0.36, 0.28, 0.28, 0.24, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.04, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000000 10111111110011111111010110111101111010010100100100 10111111111111111111100110010000000000000000000000 10001111110111101011011010011100010101000000000000 10000111110010101011010010001100010001010000000000 10000000110010101001000010000100010001010100000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8949 @ step 138000 epoch 12.14 loss: 0.391748, lagrangian_loss: 0.001875, attention_score_distillation_loss: 0.000010 loss: 0.035931, lagrangian_loss: -0.001861, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 02:15:27 Evaluating: accuracy: 0.8911, eval_loss: 0.5233, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6517, expected_sequence_sparsity: 0.9192, target_sparsity: 0.65, step: 144000 lambda_1: -2.2427, lambda_2: 833.6587 lambda_3: 0.0000 train remain: [0.72 0.66 0.46 0.48 0.35 0.28 0.28 0.29 0.95] infer remain: [0.72, 0.66, 0.44, 0.46, 0.34, 0.28, 0.28, 0.24, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0] 01111111111111111111111110111111111110100000000000 10111111110011111111010110111101111010010100100100 10111111111111111011100110100000000000000000000000 10001111110111101011011010011100010101000000000000 10000111110010101011010010001100010000010000000000 10000000110010101001010010000100010001010100000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010000000001 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8949 @ step 138000 epoch 12.14 loss: 0.833562, lagrangian_loss: 0.003554, attention_score_distillation_loss: 0.000010 loss: 0.029055, lagrangian_loss: -0.000370, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 02:29:49 Evaluating: accuracy: 0.8857, eval_loss: 0.5672, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6504, expected_sequence_sparsity: 0.9189, target_sparsity: 0.65, step: 147000 lambda_1: -1.6238, lambda_2: 850.7535 lambda_3: 0.0000 train remain: [0.72 0.66 0.47 0.47 0.35 0.29 0.28 0.39 0.96] infer remain: [0.72, 0.66, 0.46, 0.46, 0.34, 0.28, 0.28, 0.24, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0] 01111111111111111111111110111111111110100000000000 10111111110011111111010110111101111010010100100100 10111111111111111011100110101000000000000000000000 10001111110111101011011010011100010001000010000000 10000111110010101011010010001100010000010000000000 10000001110010101001000010000100010001010100000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8949 @ step 138000 epoch 12.14 loss: 0.978059, lagrangian_loss: 0.005119, attention_score_distillation_loss: 0.000010 ETA: 1 day, 0:37:22 | Epoch 12 finished. Took 3338.46 seconds. loss: 0.472538, lagrangian_loss: 0.010825, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 02:44:10 Evaluating: accuracy: 0.8881, eval_loss: 0.5683, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6504, expected_sequence_sparsity: 0.9189, target_sparsity: 0.65, step: 150000 lambda_1: -1.7713, lambda_2: 866.8922 lambda_3: 0.0000 train remain: [0.72 0.66 0.47 0.47 0.36 0.29 0.28 0.43 0.95] infer remain: [0.72, 0.66, 0.46, 0.46, 0.34, 0.28, 0.28, 0.24, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0] 01111111111111111111111110111111111110100000000000 10101111111011111111010110111101111010010100100100 10111111111111111011100110000001100000000000000000 10001111110111101011011010011100010001000010000000 10000111110010101011010010001100010000010000000000 10000000110010101001000010000100010001010110000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8949 @ step 138000 epoch 12.14 loss: 0.023684, lagrangian_loss: -0.000878, attention_score_distillation_loss: 0.000010 loss: 0.013496, lagrangian_loss: 0.000733, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 02:58:30 Evaluating: accuracy: 0.8935, eval_loss: 0.4978, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6517, expected_sequence_sparsity: 0.9192, target_sparsity: 0.65, step: 153000 lambda_1: -1.5745, lambda_2: 883.7991 lambda_3: 0.0000 train remain: [0.72 0.66 0.46 0.47 0.36 0.29 0.28 0.35 0.95] infer remain: [0.72, 0.66, 0.44, 0.46, 0.34, 0.28, 0.28, 0.24, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000000 10101111110111111111010110111101111010010100100100 10111111111111111011100110000000000000000010000000 10001111110111101011011010011100010001000010000000 10000111110010101011010010001100010000010000000000 10000000110010101001010010000100010001010100000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8949 @ step 138000 epoch 12.14 loss: 0.093153, lagrangian_loss: 0.000674, attention_score_distillation_loss: 0.000010 loss: 0.120158, lagrangian_loss: -0.000088, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 03:13:04 Evaluating: accuracy: 0.8931, eval_loss: 0.5467, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6517, expected_sequence_sparsity: 0.9192, target_sparsity: 0.65, step: 156000 lambda_1: -1.8309, lambda_2: 900.7239 lambda_3: 0.0000 train remain: [0.72 0.66 0.46 0.47 0.35 0.29 0.28 0.31 0.96] infer remain: [0.72, 0.66, 0.44, 0.46, 0.34, 0.28, 0.28, 0.24, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000000 10101111110011111111010110111101111010010100110100 10111111111111111011100110000000000100000000000000 10001111110111101011011010011100010001000010000000 10000011110010101011010010001100010001010000000000 10000000110010101001010010000100010001010100000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8949 @ step 138000 epoch 12.14 loss: 0.427451, lagrangian_loss: 0.002730, attention_score_distillation_loss: 0.000010 loss: 0.032086, lagrangian_loss: 0.001369, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 03:27:27 Evaluating: accuracy: 0.8967, eval_loss: 0.5235, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6504, expected_sequence_sparsity: 0.9189, target_sparsity: 0.65, step: 159000 lambda_1: -1.9873, lambda_2: 918.0266 lambda_3: 0.0000 train remain: [0.72 0.66 0.46 0.47 0.36 0.29 0.28 0.31 0.96] infer remain: [0.72, 0.66, 0.46, 0.46, 0.34, 0.28, 0.28, 0.24, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0] 01111111111111111111111110111111111110100000000000 10101111111011111111010110111101111010010100100100 10111111111111111011100110010000010000000000000000 10001111110111101011011010011100010001000010000000 10000011110010101011010010001100010001010000000000 10000000110010101001000010000100010001010100000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8949 @ step 138000 epoch 12.14 Saving the best model so far: [Epoch 13 | Step: 159000 | MACs sparsity: 0.6649 | Score: 0.8967 | Loss: 0.5235] loss: 0.513739, lagrangian_loss: 0.000643, attention_score_distillation_loss: 0.000010 ETA: 23:44:43 | Epoch 13 finished. Took 3349.85 seconds. loss: 0.512911, lagrangian_loss: 0.008754, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 03:42:20 Evaluating: accuracy: 0.8982, eval_loss: 0.5212, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6517, expected_sequence_sparsity: 0.9192, target_sparsity: 0.65, step: 162000 lambda_1: -1.8915, lambda_2: 935.5117 lambda_3: 0.0000 train remain: [0.72 0.66 0.45 0.46 0.36 0.29 0.28 0.35 0.95] infer remain: [0.72, 0.66, 0.44, 0.46, 0.34, 0.28, 0.28, 0.24, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000000 10101111110011111111110110111101111010010100100100 10111111111111111011100110010000000000000000000000 10001111110111101011011010011100010001000010000000 10000011110010101011010010001100010001010000000000 10000000110010101001000010000100010001010110000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8967 @ step 159000 epoch 13.98 Saving the best model so far: [Epoch 14 | Step: 162000 | MACs sparsity: 0.6649 | Score: 0.8982 | Loss: 0.5212] loss: 0.050625, lagrangian_loss: -0.000414, attention_score_distillation_loss: 0.000010 loss: 0.146055, lagrangian_loss: -0.000026, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 03:57:24 Evaluating: accuracy: 0.8962, eval_loss: 0.5215, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6504, expected_sequence_sparsity: 0.9189, target_sparsity: 0.65, step: 165000 lambda_1: -1.1926, lambda_2: 953.1530 lambda_3: 0.0000 train remain: [0.72 0.66 0.45 0.46 0.35 0.28 0.29 0.4 0.94] infer remain: [0.72, 0.66, 0.46, 0.46, 0.34, 0.28, 0.28, 0.26, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.22, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000000 10101111111011111111010110111101111010010100100100 10111111111111111011100110010000010000000000000000 10001111110111101011011010011100010001000010000000 10000011110010101011010010001101010000010000000000 10000001110010101001000010000100010001010100000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000001 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8982 @ step 162000 epoch 14.25 loss: 0.296017, lagrangian_loss: 0.000033, attention_score_distillation_loss: 0.000010 loss: 0.034512, lagrangian_loss: 0.001690, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 04:11:51 Evaluating: accuracy: 0.8993, eval_loss: 0.518, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6517, expected_sequence_sparsity: 0.9192, target_sparsity: 0.65, step: 168000 lambda_1: -1.6172, lambda_2: 969.9517 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.45 0.35 0.28 0.29 0.46 0.95] infer remain: [0.72, 0.66, 0.44, 0.46, 0.34, 0.28, 0.28, 0.26, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.1, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000000 10101111110011111111010110111101111010010110100100 10111111111111111011100110000000010000000000000000 10001111110111101011011010011100010001000010000000 10000011110010101011010010001101010000010000000000 10000000110010101001000010000100010001010100000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000001 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8982 @ step 162000 epoch 14.25 Saving the best model so far: [Epoch 14 | Step: 168000 | MACs sparsity: 0.6649 | Score: 0.8993 | Loss: 0.518] loss: 0.390874, lagrangian_loss: -0.000075, attention_score_distillation_loss: 0.000010 loss: 0.030909, lagrangian_loss: 0.000422, attention_score_distillation_loss: 0.000010 ETA: 22:48:01 | Epoch 14 finished. Took 3219.6 seconds. ---------------------------------------------------------------------- time: 2023-07-20 04:27:23 Evaluating: accuracy: 0.8982, eval_loss: 0.5322, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6522, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 171000 lambda_1: -2.3461, lambda_2: 987.2769 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.45 0.35 0.28 0.29 0.41 0.96] infer remain: [0.72, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.26, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000000 10101111110011111111011110111101111010010100100100 10111111111111111011100110010000000000000000000000 10001111110110101011011010011100010001000010000000 10000011110010101011010010001101010000010000000000 10000000110010101001000010000100010101010100000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000001 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8993 @ step 168000 epoch 14.77 loss: 0.539738, lagrangian_loss: -0.001098, attention_score_distillation_loss: 0.000010 loss: 0.386357, lagrangian_loss: -0.000066, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 04:41:51 Evaluating: accuracy: 0.8994, eval_loss: 0.544, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6522, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 174000 lambda_1: -1.4821, lambda_2: 1004.1932 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.45 0.35 0.29 0.28 0.42 0.96] infer remain: [0.72, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.26, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000000 10101111110111111111010110111101111010010100100100 10111111111111111011100110000000001000000000000000 10001111110110101011011010011100010001000010000000 10000011110010101011010010001100010000010100000000 10000000110010001001000010000100010101010100000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000001 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8993 @ step 168000 epoch 14.77 Saving the best model so far: [Epoch 15 | Step: 174000 | MACs sparsity: 0.6649 | Score: 0.8994 | Loss: 0.544] loss: 0.404674, lagrangian_loss: -0.000497, attention_score_distillation_loss: 0.000010 loss: 0.028135, lagrangian_loss: 0.000778, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 04:57:06 Evaluating: accuracy: 0.8991, eval_loss: 0.5316, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6522, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 177000 lambda_1: -2.1925, lambda_2: 1022.0064 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.45 0.35 0.28 0.28 0.39 0.95] infer remain: [0.72, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.26, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000000 10101111110011111111010110111101111110010100100100 10111111111111111011100110000000000000000000001000 10001111110110101011011010011100011001000000000000 10000011110010101011010010001100010000010100000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000001 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8994 @ step 174000 epoch 15.30 loss: 0.224978, lagrangian_loss: -0.000510, attention_score_distillation_loss: 0.000010 loss: 0.031553, lagrangian_loss: 0.004469, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 05:11:35 Evaluating: accuracy: 0.8966, eval_loss: 0.5306, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6522, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 180000 lambda_1: -1.2492, lambda_2: 1039.1138 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.45 0.35 0.28 0.28 0.37 0.95] infer remain: [0.72, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.26, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000000 10101111110011111111011110111101111010010100100100 10111111111111111011100110000100000000000000000000 10001111110110101011011010011100010001000010000000 10000011110010101011010010001100010000011000000000 10000000110010001001000010001100010101010100000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000001 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8994 @ step 174000 epoch 15.30 loss: 0.048526, lagrangian_loss: -0.000178, attention_score_distillation_loss: 0.000010 loss: 0.336060, lagrangian_loss: 0.002401, attention_score_distillation_loss: 0.000010 ETA: 21:55:47 | Epoch 15 finished. Took 3382.28 seconds. ---------------------------------------------------------------------- time: 2023-07-20 05:26:02 Evaluating: accuracy: 0.8983, eval_loss: 0.5317, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6649, expected_sparsity: 0.6522, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 183000 lambda_1: -6.2507, lambda_2: 1056.3933 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.45 0.34 0.28 0.28 0.41 0.72] infer remain: [0.72, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.26, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000000 10101111110111111111010110111101111010010100100100 10111111111111111011100110000000000000010000000000 10001111110110101011011010011100010001100000000000 10000011110010101011010010001100010000010010000000 10000000110010001001010010000100010001010100000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010000000011 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8994 @ step 174000 epoch 15.30 loss: 0.282242, lagrangian_loss: -0.003303, attention_score_distillation_loss: 0.000010 loss: 0.030904, lagrangian_loss: -0.003420, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 05:40:30 Evaluating: accuracy: 0.898, eval_loss: 0.5156, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6649, expected_sparsity: 0.6522, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 186000 lambda_1: -3.1781, lambda_2: 1073.7429 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.45 0.34 0.28 0.28 0.36 0.61] infer remain: [0.72, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.26, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.48, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000000 10101111110011111111010110111101111010010101100100 10111111111111111011100110000000000001000000000000 10001111110110101011011010011100010001000001000000 10000011110010101011010010001100010000010100000000 10000010110010001001000010000101010001010100000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000001 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.8994 @ step 174000 epoch 15.30 loss: 0.012500, lagrangian_loss: 0.008274, attention_score_distillation_loss: 0.000010 loss: 0.026993, lagrangian_loss: 0.001338, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 05:54:58 Evaluating: accuracy: 0.8981, eval_loss: 0.5547, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6594, expected_sparsity: 0.6487, expected_sequence_sparsity: 0.9185, target_sparsity: 0.65, step: 189000 lambda_1: -1.8263, lambda_2: 1090.6847 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.45 0.34 0.28 0.28 0.39 0.73] infer remain: [0.74, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.26, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000010000 10101111110011111111110110111101111010010100100100 10111111111111111011100110000000100000000000000000 10001111110110101011011010011100010001000001000000 10000011110010101011010010001100010001010000000000 10000000110010101001000010000100010001010100000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010000000011 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8994 @ step 174000 epoch 15.30 loss: 0.015875, lagrangian_loss: 0.002947, attention_score_distillation_loss: 0.000010 loss: 0.019508, lagrangian_loss: -0.001077, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 06:09:28 Evaluating: accuracy: 0.8984, eval_loss: 0.5226, token_prune_loc: [True, True, True, True, True, True, True, True, False], macs_sparsity: 0.6594, expected_sparsity: 0.6487, expected_sequence_sparsity: 0.9185, target_sparsity: 0.65, step: 192000 lambda_1: -1.8976, lambda_2: 1107.8623 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.45 0.33 0.28 0.28 0.32 0.63] infer remain: [0.74, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.26, 1.0] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000010 10111111110011111111010110111101111010010100100100 10111111111111111011100110000000000000000000000100 10001111110110101011011010011100010001000001000000 10000011110010101011010010001100010001010000000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010000000011 11111111111111111111111111111111111111111111111111 Best eval score so far: 0.8994 @ step 174000 epoch 15.30 loss: 0.323650, lagrangian_loss: 0.006087, attention_score_distillation_loss: 0.000010 ETA: 21:01:50 | Epoch 16 finished. Took 3328.64 seconds. loss: 0.019672, lagrangian_loss: 0.001162, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 06:23:42 Evaluating: accuracy: 0.8984, eval_loss: 0.5406, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6488, expected_sequence_sparsity: 0.9185, target_sparsity: 0.65, step: 195000 lambda_1: -2.2479, lambda_2: 1124.6880 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.45 0.33 0.28 0.27 0.29 0.44] infer remain: [0.74, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000010 10101111110011111111010110111101111110010100100100 10111111111111111011100110000000000000010000000000 10001111110110101011011010011100010001000001000000 10000011110010101011010010001100010001010000000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010000000001 10000000110010001001000010000100010001010000000001 Best eval score so far: 0.8994 @ step 174000 epoch 15.30 loss: 0.380577, lagrangian_loss: 0.002817, attention_score_distillation_loss: 0.000010 loss: 0.834360, lagrangian_loss: -0.001071, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 06:38:02 Evaluating: accuracy: 0.8987, eval_loss: 0.5499, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6488, expected_sequence_sparsity: 0.9185, target_sparsity: 0.65, step: 198000 lambda_1: -1.3839, lambda_2: 1142.3954 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.45 0.33 0.28 0.28 0.29 0.53] infer remain: [0.74, 0.66, 0.44, 0.44, 0.34, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111111111111111110100000000000 10101111110111111111010110111101111010010100100100 10111111111111111011100110000001000000000000000000 10001111110110101011011010011101010001000000000000 10000001110110101011010010001100010100010000000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.8994 @ step 174000 epoch 15.30 loss: 0.017833, lagrangian_loss: 0.009395, attention_score_distillation_loss: 0.000010 loss: 0.299865, lagrangian_loss: 0.005452, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 06:52:23 Evaluating: accuracy: 0.8995, eval_loss: 0.5661, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 201000 lambda_1: -2.4810, lambda_2: 1159.6271 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.45 0.33 0.28 0.28 0.31 0.42] infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111010110111111111010010100100100 10111111111111111011110110000000000000000000000000 10001111110110101011011010011101010001000000000000 10000001110110101011010010001100010000010000000000 10000000110010001001010010000100010001010100000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.8994 @ step 174000 epoch 15.30 Saving the best model so far: [Epoch 17 | Step: 201000 | MACs sparsity: 0.6594 | Score: 0.8995 | Loss: 0.5661] loss: 0.365697, lagrangian_loss: 0.001115, attention_score_distillation_loss: 0.000010 loss: 0.017773, lagrangian_loss: 0.001056, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 07:07:14 Evaluating: accuracy: 0.8999, eval_loss: 0.543, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 204000 lambda_1: -1.5292, lambda_2: 1176.3229 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.45 0.33 0.28 0.28 0.32 0.42] infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000100 11101111110011111111010110111101111010010100100100 10111111111111111011100110000100000000000000000000 10001111110110101011011010011101010001000000000000 10000001110010101011010010001100010000010100000000 10000010110010001001000010000100010101010100000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.8995 @ step 201000 epoch 17.68 Saving the best model so far: [Epoch 17 | Step: 204000 | MACs sparsity: 0.6594 | Score: 0.8999 | Loss: 0.543] loss: 0.493361, lagrangian_loss: 0.008221, attention_score_distillation_loss: 0.000010 ETA: 20:08:46 | Epoch 17 finished. Took 3379.56 seconds. loss: 0.029668, lagrangian_loss: 0.001050, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 07:22:15 Evaluating: accuracy: 0.9009, eval_loss: 0.548, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 207000 lambda_1: -1.3423, lambda_2: 1193.8641 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.44 0.33 0.28 0.28 0.36 0.46] infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.26, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000010000 10101111110011111111010110111101111011010100100100 10111111111111111111100110000000000000000000000000 10001111110110101011011010011101010001000000000000 10000001110010101011010010001100010000010100000000 10000001110010001001000010000100010101010100000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010000000011 10000000010010001001000010000100010001010100000001 Best eval score so far: 0.8999 @ step 204000 epoch 17.94 Saving the best model so far: [Epoch 18 | Step: 207000 | MACs sparsity: 0.6594 | Score: 0.9009 | Loss: 0.548] loss: 0.022745, lagrangian_loss: 0.000888, attention_score_distillation_loss: 0.000010 loss: 0.026422, lagrangian_loss: 0.004357, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 07:38:45 Evaluating: accuracy: 0.9012, eval_loss: 0.5363, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 210000 lambda_1: -0.9608, lambda_2: 1211.0245 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.44 0.32 0.28 0.27 0.38 0.48] infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110101000000000 10111111110011111111010110111101111010010100100100 10111111111111111011100110001000000000000000000000 10001111110110101011011010011101010001000000000000 10000001110110101011010010001100010000010000000000 10000001110010001001000010000100010001010100000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9009 @ step 207000 epoch 18.20 Saving the best model so far: [Epoch 18 | Step: 210000 | MACs sparsity: 0.6594 | Score: 0.9012 | Loss: 0.5363] loss: 0.459285, lagrangian_loss: 0.000205, attention_score_distillation_loss: 0.000010 loss: 0.036423, lagrangian_loss: 0.000209, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 07:53:18 Evaluating: accuracy: 0.902, eval_loss: 0.5128, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 213000 lambda_1: -1.1285, lambda_2: 1227.9618 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.44 0.32 0.28 0.27 0.43 0.53] infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100001000000 10101111110011111111011110111101111010010100100100 10111111111111111011100110100000000000000000000000 10001111110110101011011010011100010001000010000000 10000001110010101011010010001100010001010000000000 10000000110010001001000010000100010101010100000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 10000000010010001001000010000100010001010100000001 Best eval score so far: 0.9012 @ step 210000 epoch 18.47 Saving the best model so far: [Epoch 18 | Step: 213000 | MACs sparsity: 0.6594 | Score: 0.902 | Loss: 0.5128] loss: 0.173927, lagrangian_loss: -0.000254, attention_score_distillation_loss: 0.000010 loss: 0.348842, lagrangian_loss: 0.000050, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 08:08:18 Evaluating: accuracy: 0.9013, eval_loss: 0.5322, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 216000 lambda_1: -1.2842, lambda_2: 1245.1265 lambda_3: 0.0000 train remain: [0.73 0.65 0.45 0.44 0.32 0.28 0.28 0.42 0.45] infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111110110111101111010010100100100 10111111111111111011100110000000100000000000000000 10001111110110101011011010011101010001000000000000 10000001110010101011010010001100010001010000000000 10000000110010001001010010000100010001010100000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9020 @ step 213000 epoch 18.73 loss: 0.034287, lagrangian_loss: 0.002586, attention_score_distillation_loss: 0.000010 ETA: 19:17:13 | Epoch 18 finished. Took 3480.99 seconds. loss: 0.124186, lagrangian_loss: 0.012397, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 08:22:30 Evaluating: accuracy: 0.903, eval_loss: 0.5544, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 219000 lambda_1: -1.7371, lambda_2: 1261.9095 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.44 0.32 0.28 0.27 0.45 0.43] infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111011110111101111010010100100100 10111111111111111011100110000000100000000000000000 10001111110110101011011010011101010001000000000000 10000001110010101011010010001100010001010000000000 10000000110010101001000010000100010001010110000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9020 @ step 213000 epoch 18.73 Saving the best model so far: [Epoch 19 | Step: 219000 | MACs sparsity: 0.6594 | Score: 0.903 | Loss: 0.5544] loss: 0.016656, lagrangian_loss: 0.006842, attention_score_distillation_loss: 0.000010 loss: 0.191987, lagrangian_loss: 0.001296, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 08:36:59 Evaluating: accuracy: 0.9022, eval_loss: 0.5016, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 222000 lambda_1: -1.1584, lambda_2: 1279.1924 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.44 0.32 0.28 0.27 0.51 0.33] infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100001000000 10111111110011111111010110111101111010010100100100 10111111111111111011100110000000000000000100000000 10001111110110101011011010011101010001000000000000 10000001110010101011010010001100010001010000000000 10000000110010001001010010000100010001010100000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9030 @ step 219000 epoch 19.26 loss: 0.053421, lagrangian_loss: 0.000117, attention_score_distillation_loss: 0.000010 loss: 0.015838, lagrangian_loss: 0.002198, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 08:51:08 Evaluating: accuracy: 0.9007, eval_loss: 0.5236, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 225000 lambda_1: -1.1455, lambda_2: 1296.5518 lambda_3: 0.0000 train remain: [0.73 0.66 0.45 0.44 0.32 0.28 0.27 0.61 0.3 ] infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111111011111111010110111101111010010100100100 10111111111111111011100110000100000000000000000000 10001111110110101011011010011101010001000000000000 10000001110010101011010010001100010001010000000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9030 @ step 219000 epoch 19.26 loss: 0.015111, lagrangian_loss: 0.001796, attention_score_distillation_loss: 0.000010 loss: 0.012449, lagrangian_loss: -0.000120, attention_score_distillation_loss: 0.000010 ETA: 18:18:23 | Epoch 19 finished. Took 3082.5 seconds. ---------------------------------------------------------------------- time: 2023-07-20 09:05:26 Evaluating: accuracy: 0.9014, eval_loss: 0.5296, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 228000 lambda_1: -2.2491, lambda_2: 1313.6938 lambda_3: 0.0000 train remain: [0.74 0.66 0.45 0.43 0.32 0.28 0.28 0.49 0.3 ] infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111010110111101111011010100100100 10111111111111111111100110000000000000000000000000 10000111110110101011011010011100010101000000000000 10000001110010101011010010001100010001010000000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010000000001 10000000110010001001000010000100010001010100000000 Best eval score so far: 0.9030 @ step 219000 epoch 19.26 loss: 0.009079, lagrangian_loss: -0.000901, attention_score_distillation_loss: 0.000010 loss: 0.015812, lagrangian_loss: 0.008045, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 09:19:39 Evaluating: accuracy: 0.9026, eval_loss: 0.5065, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 231000 lambda_1: -0.7011, lambda_2: 1331.4109 lambda_3: 0.0000 train remain: [0.74 0.66 0.45 0.43 0.32 0.28 0.27 0.4 0.27] infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000010 10101111110011111111010110111111111010010100100100 10111111111111111011110110000000000000000000000000 10000111110110101011011010011101010001000000000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010001010101000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9030 @ step 219000 epoch 19.26 loss: 0.565751, lagrangian_loss: -0.000060, attention_score_distillation_loss: 0.000010 loss: 0.011595, lagrangian_loss: 0.000126, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 09:33:56 Evaluating: accuracy: 0.9017, eval_loss: 0.5104, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 234000 lambda_1: -1.2062, lambda_2: 1348.4353 lambda_3: 0.0000 train remain: [0.74 0.66 0.45 0.43 0.32 0.28 0.27 0.52 0.28] infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10111111110011111111010110111101111010010100100100 10111111111111111011100110000000100000000000000000 10000111110110101011011010011101010001000000000000 10000001110010101001010010001100010001010100000000 10000000110010001001010010000100010001010100000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 10000000010010001001000010000100010001010100000001 Best eval score so far: 0.9030 @ step 219000 epoch 19.26 loss: 0.083085, lagrangian_loss: -0.000266, attention_score_distillation_loss: 0.000010 loss: 0.012467, lagrangian_loss: 0.006103, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 09:48:13 Evaluating: accuracy: 0.9017, eval_loss: 0.526, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 237000 lambda_1: -1.0935, lambda_2: 1365.8163 lambda_3: 0.0000 train remain: [0.73 0.66 0.44 0.43 0.32 0.28 0.27 0.52 0.25] infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100100000000 10101111110011111111010110111101111010010100100110 10111111111111111011110110000000000000000000000000 10000111110110101011011010011100010001010000000000 10000001110010101001010010001100010001010010000000 10000000110010001001000010000100010001010110000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9030 @ step 219000 epoch 19.26 loss: 0.017863, lagrangian_loss: 0.005429, attention_score_distillation_loss: 0.000010 loss: 0.011142, lagrangian_loss: 0.000162, attention_score_distillation_loss: 0.000010 ETA: 17:23:18 | Epoch 20 finished. Took 3285.07 seconds. ---------------------------------------------------------------------- time: 2023-07-20 10:02:25 Evaluating: accuracy: 0.9028, eval_loss: 0.5382, token_prune_loc: [True, True, True, True, True, True, True, False, True], macs_sparsity: 0.6594, expected_sparsity: 0.6493, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 240000 lambda_1: -0.9370, lambda_2: 1383.4458 lambda_3: 0.0000 train remain: [0.74 0.66 0.44 0.43 0.32 0.28 0.28 0.63 0.25] infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 1.0, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100010000000 10101111110011111111011110111101111010010100100100 10111111111111111011110110000000000000000000000000 10001111110110101011011010011100010001000000000000 10000011110010101001010010001100010001010000000000 10000000110010001001000010000100010001010110000001 10000000110010001001000010000100010001010100000011 11111111111111111111111111111111111111111111111111 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9030 @ step 219000 epoch 19.26 loss: 0.007702, lagrangian_loss: 0.008356, attention_score_distillation_loss: 0.000010 loss: 0.266675, lagrangian_loss: 0.003927, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 10:16:39 Evaluating: accuracy: 0.9044, eval_loss: 0.5303, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 243000 lambda_1: -1.2799, lambda_2: 1400.5848 lambda_3: 0.0000 train remain: [0.74 0.66 0.44 0.43 0.32 0.28 0.28 0.59 0.26] infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 11101111110011111111010110111101111010010100100100 10111111111111111011100110000000000100000000000000 10000111110110101011011010011100010101000000000000 10000001110010101001010010001100010101010000000000 10000000110010001001000010000100010101010100000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9030 @ step 219000 epoch 19.26 Saving the best model so far: [Epoch 21 | Step: 243000 | MACs sparsity: 0.6594 | Score: 0.9044 | Loss: 0.5303] loss: 0.124508, lagrangian_loss: 0.000145, attention_score_distillation_loss: 0.000010 loss: 0.090828, lagrangian_loss: 0.001139, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 10:31:09 Evaluating: accuracy: 0.9021, eval_loss: 0.5192, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 246000 lambda_1: -1.1477, lambda_2: 1417.9023 lambda_3: 0.0000 train remain: [0.74 0.66 0.45 0.43 0.32 0.28 0.28 0.51 0.26] infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100001000000 10101111110011111111011110111101111010010100100100 10111111111111111111100110000000000000000000000000 10000111110110101011011010011100010101000000000000 10000001110010101001010010001100010101010000000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 10000000110010001001000010000100010001010000000001 Best eval score so far: 0.9044 @ step 243000 epoch 21.37 loss: 0.021891, lagrangian_loss: 0.001887, attention_score_distillation_loss: 0.000010 loss: 0.013079, lagrangian_loss: 0.031589, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 10:45:19 Evaluating: accuracy: 0.903, eval_loss: 0.515, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 249000 lambda_1: -0.8239, lambda_2: 1435.4276 lambda_3: 0.0000 train remain: [0.74 0.66 0.44 0.43 0.32 0.28 0.28 0.42 0.26] infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000010 10101111110011111111010110111101111010010100100110 10111111111111111111100110000000000000000000000000 10000111110110101011011010011101010001000000000000 10000001110010101001010010001100010101010000000000 10000000110010001001000010000100010001010110000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 10000000110010001001000010000100010001010000000001 Best eval score so far: 0.9044 @ step 243000 epoch 21.37 loss: 0.013151, lagrangian_loss: 0.001103, attention_score_distillation_loss: 0.000010 ETA: 16:28:20 | Epoch 21 finished. Took 3289.8 seconds. loss: 0.067399, lagrangian_loss: 0.000010, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 10:59:26 Evaluating: accuracy: 0.9034, eval_loss: 0.5368, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 252000 lambda_1: -0.9058, lambda_2: 1452.7882 lambda_3: 0.0000 train remain: [0.74 0.66 0.44 0.43 0.32 0.28 0.28 0.37 0.26] infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000100000 10101111110111111111010110111101111010010100100100 10111111111111111111100110000000000000000000000000 10000111110110101011011010011100010001000010000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010101010110000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 10000000110010001001000010000100010001010000000001 Best eval score so far: 0.9044 @ step 243000 epoch 21.37 loss: 0.219977, lagrangian_loss: 0.003928, attention_score_distillation_loss: 0.000010 loss: 0.248259, lagrangian_loss: -0.000301, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 11:13:34 Evaluating: accuracy: 0.9031, eval_loss: 0.5162, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 255000 lambda_1: -0.4067, lambda_2: 1470.0902 lambda_3: 0.0000 train remain: [0.74 0.66 0.44 0.43 0.32 0.28 0.28 0.34 0.25] infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111111100000000000 10101111110011111111010110111101111011010100100100 10111111111111111011100110001000000000000000000000 10001111110110101011011010011100010001000000000000 10000001110010101001010010001100010101010000000000 10000001110010001011000010000100010001010100000000 10000000110010001001000010010100010001010100000001 10000000110010001001000010000100010001010100000000 10000000010010001001000010000100010001010100000001 Best eval score so far: 0.9044 @ step 243000 epoch 21.37 loss: 0.014459, lagrangian_loss: 0.002309, attention_score_distillation_loss: 0.000010 loss: 0.347549, lagrangian_loss: 0.010246, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 11:27:42 Evaluating: accuracy: 0.9051, eval_loss: 0.5354, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 258000 lambda_1: -1.1459, lambda_2: 1487.8198 lambda_3: 0.0000 train remain: [0.74 0.65 0.44 0.43 0.32 0.28 0.28 0.35 0.25] infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111010110111101111110010100100100 10111111111111111011100110000000000000000010000000 10000111110110101011011010011101010001000000000000 10000001110010101001010010001100010101010000000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 10000000010010001001000010000100010001010100000001 Best eval score so far: 0.9044 @ step 243000 epoch 21.37 Saving the best model so far: [Epoch 22 | Step: 258000 | MACs sparsity: 0.6594 | Score: 0.9051 | Loss: 0.5354] loss: 0.076614, lagrangian_loss: 0.000602, attention_score_distillation_loss: 0.000010 loss: 0.013523, lagrangian_loss: 0.001096, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 11:42:15 Evaluating: accuracy: 0.9042, eval_loss: 0.5124, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 261000 lambda_1: -1.0278, lambda_2: 1504.6995 lambda_3: 0.0000 train remain: [0.74 0.66 0.44 0.43 0.32 0.28 0.28 0.45 0.26] infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10111111110011111111010110111101111010010100100100 10111111111111111011100110000000000000000000010000 10000111110110101011011010011101010101000000000000 10000001110010101001010010001100010001010010000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000000 10000000110010001001000010000100010001010000000001 Best eval score so far: 0.9051 @ step 258000 epoch 22.69 loss: 0.040849, lagrangian_loss: 0.001851, attention_score_distillation_loss: 0.000010 ETA: 15:33:17 | Epoch 22 finished. Took 3283.32 seconds. loss: 0.097342, lagrangian_loss: 0.009542, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 11:56:25 Evaluating: accuracy: 0.9032, eval_loss: 0.547, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.649, expected_sequence_sparsity: 0.9186, target_sparsity: 0.65, step: 264000 lambda_1: -0.6112, lambda_2: 1521.8701 lambda_3: 0.0000 train remain: [0.74 0.66 0.44 0.43 0.32 0.28 0.29 0.51 0.25] infer remain: [0.74, 0.66, 0.44, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110101000000000 10101111110011111111011110111101111010010100100100 10111111111111111011100110000000000000000000100000 10000111110110101011011010011101010101000000000000 10000011110010101001010010001100010001010000000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010000000001 10000000110010001001000010000100010001010000000001 Best eval score so far: 0.9051 @ step 258000 epoch 22.69 loss: 0.039525, lagrangian_loss: 0.003719, attention_score_distillation_loss: 0.000010 loss: 0.004788, lagrangian_loss: 0.004833, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 12:10:34 Evaluating: accuracy: 0.9022, eval_loss: 0.5393, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 267000 lambda_1: -0.9719, lambda_2: 1540.0824 lambda_3: 0.0000 train remain: [0.74 0.66 0.43 0.43 0.32 0.28 0.3 0.42 0.25] infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111110110111101111010010100100100 10111111111111111011100110000000000000100000000000 10000111110110101011011010011100010101000000000000 10000001110010101001010010001100010101010000000000 10000000110010001001000010001100010001010110000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010100000001 Best eval score so far: 0.9051 @ step 258000 epoch 22.69 loss: 0.079875, lagrangian_loss: -0.000143, attention_score_distillation_loss: 0.000010 loss: 0.336385, lagrangian_loss: 0.002171, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 12:24:41 Evaluating: accuracy: 0.9033, eval_loss: 0.552, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6503, expected_sequence_sparsity: 0.9189, target_sparsity: 0.65, step: 270000 lambda_1: -0.7755, lambda_2: 1557.3724 lambda_3: 0.0000 train remain: [0.74 0.66 0.43 0.43 0.32 0.28 0.31 0.42 0.25] infer remain: [0.74, 0.66, 0.42, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110111111111010110111101111010010100100100 10111111111111111011100110000000000000000000000000 10000111110110101011011010011100010101000001000000 10000001110010101001010010001100010001010010000000 10000001110010001001000010000100010001010100000001 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9051 @ step 258000 epoch 22.69 loss: 0.024105, lagrangian_loss: 0.001687, attention_score_distillation_loss: 0.000010 loss: 0.007869, lagrangian_loss: 0.000499, attention_score_distillation_loss: 0.000010 ETA: 14:35:41 | Epoch 23 finished. Took 3050.79 seconds. ---------------------------------------------------------------------- time: 2023-07-20 12:38:47 Evaluating: accuracy: 0.9029, eval_loss: 0.5254, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6503, expected_sequence_sparsity: 0.9189, target_sparsity: 0.65, step: 273000 lambda_1: -0.4702, lambda_2: 1574.4080 lambda_3: 0.0000 train remain: [0.74 0.66 0.43 0.43 0.32 0.28 0.3 0.39 0.24] infer remain: [0.74, 0.66, 0.42, 0.44, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100010000000 10111111110011111111010110111101111010010100100100 10111111111111111011100110000000000000000000000000 10000111110110101011011010011100010101000001000000 10000001110010101001010010001100010001010010000000 10000000110010001001010010000100010001010100000001 10000000110010001001000010000100010001010100000011 10000000010010001001000010000100010001010100000001 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9051 @ step 258000 epoch 22.69 loss: 0.389616, lagrangian_loss: 0.009942, attention_score_distillation_loss: 0.000010 loss: 0.013081, lagrangian_loss: 0.005521, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 12:52:58 Evaluating: accuracy: 0.9037, eval_loss: 0.5323, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 276000 lambda_1: -0.9596, lambda_2: 1591.1188 lambda_3: 0.0000 train remain: [0.74 0.66 0.43 0.43 0.32 0.27 0.32 0.39 0.24] infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.28, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100001000000 10111111110011111111010110111101111010010100100100 10111111111111111101100110000000000000000000000000 10000111110110101011011010011100010101000000000000 10000001110010101001010010001100010001010010000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000011 10000000010010001001000010000100010001010100000001 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9051 @ step 258000 epoch 22.69 loss: 0.016249, lagrangian_loss: 0.000320, attention_score_distillation_loss: 0.000010 loss: 0.135135, lagrangian_loss: 0.004868, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 13:07:04 Evaluating: accuracy: 0.9043, eval_loss: 0.5183, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 279000 lambda_1: -0.6309, lambda_2: 1608.5593 lambda_3: 0.0000 train remain: [0.74 0.66 0.43 0.43 0.32 0.27 0.31 0.36 0.24] infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.28, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110110000000000 10111111110011111111010110111101111010010100100100 10111111111111111001100110000000000000000010000000 10000111110110101011011010011100010101000000000000 10000011110010101001010010001100010001010000000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000001 10000000010010001001000010000100010001010100000001 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9051 @ step 258000 epoch 22.69 loss: 0.007925, lagrangian_loss: 0.001710, attention_score_distillation_loss: 0.000010 loss: 0.025356, lagrangian_loss: 0.005295, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 13:21:13 Evaluating: accuracy: 0.904, eval_loss: 0.5355, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 282000 lambda_1: -1.5243, lambda_2: 1626.1809 lambda_3: 0.0000 train remain: [0.74 0.66 0.43 0.43 0.32 0.28 0.33 0.39 0.25] infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.28, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111010110111101111110010100100100 10111111111111111001100110000100000000000000000000 10000111110110101011011010011100010001000000100000 10000001110010101001010010001100010001011000000000 10000000110110001001000010000101010001010100000000 10000000110010001001000010000100010001010100000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9051 @ step 258000 epoch 22.69 loss: 0.299053, lagrangian_loss: 0.000737, attention_score_distillation_loss: 0.000010 loss: 0.703120, lagrangian_loss: 0.000250, attention_score_distillation_loss: 0.000010 ETA: 13:40:42 | Epoch 24 finished. Took 3259.07 seconds. ---------------------------------------------------------------------- time: 2023-07-20 13:35:22 Evaluating: accuracy: 0.9049, eval_loss: 0.5242, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6495, expected_sequence_sparsity: 0.9187, target_sparsity: 0.65, step: 285000 lambda_1: -0.2742, lambda_2: 1643.1600 lambda_3: 0.0000 train remain: [0.74 0.66 0.43 0.43 0.32 0.28 0.31 0.45 0.25] infer remain: [0.74, 0.66, 0.44, 0.42, 0.32, 0.28, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000100000 10101111110011111111010110111101111011010100100100 10111111111111111001100110000000000001010000000000 10000111110110101011011010011100010001000000100000 10000001110010101001010010001100010001011000000000 10000000110010001001000010000100010001010111000000 10000000110010001001000010000100010001010110000000 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9051 @ step 258000 epoch 22.69 loss: 0.015952, lagrangian_loss: 0.000811, attention_score_distillation_loss: 0.000010 loss: 0.018504, lagrangian_loss: 0.011957, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 13:49:31 Evaluating: accuracy: 0.9041, eval_loss: 0.5208, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 288000 lambda_1: -0.9049, lambda_2: 1660.4192 lambda_3: 0.0000 train remain: [0.74 0.66 0.43 0.43 0.32 0.28 0.3 0.44 0.25] infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.28, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100001000000 10101111110111111111010110111101111010010100100100 10111111111111111001100110000100000000000000000000 10000111110110101011011010111100010001000000000000 10000001110010101001010010001100010101010000000000 10000000110010001001000010000100010101010100000001 10000000110010001001000010000100010001010100000001 10000000010010001001000010000100010001010100000001 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9051 @ step 258000 epoch 22.69 loss: 0.009474, lagrangian_loss: 0.003730, attention_score_distillation_loss: 0.000010 loss: 0.015891, lagrangian_loss: 0.001593, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 14:03:43 Evaluating: accuracy: 0.9046, eval_loss: 0.5021, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 291000 lambda_1: -0.6821, lambda_2: 1677.4827 lambda_3: 0.0000 train remain: [0.74 0.66 0.43 0.43 0.31 0.28 0.3 0.5 0.24] infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.28, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110101000000000 10101111110011111111010110111101111010010100101100 10111111111111111001100110000000000000000100000000 10000111110110101011011010011100010001000010000000 10000001110010101001010010001100010001010100000000 10000000110010001001010010000100010001010100000001 10000000110010001001000010000100010001010100000001 10000000110010001001000010000100010001010100000000 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9051 @ step 258000 epoch 22.69 loss: 0.174100, lagrangian_loss: 0.028944, attention_score_distillation_loss: 0.000010 loss: 0.008187, lagrangian_loss: 0.002347, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 14:17:51 Evaluating: accuracy: 0.9044, eval_loss: 0.5215, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 294000 lambda_1: -0.5183, lambda_2: 1694.6652 lambda_3: 0.0000 train remain: [0.74 0.66 0.43 0.43 0.32 0.28 0.31 0.46 0.26] infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.28, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111111111111111110100000000000 10101111110011111111010111111101111010010100100100 10111111111111111001110110000000000000000000000000 10000111110110101011011010011100010101000000000000 10000001110010101001010010001100010001010010000000 10000001110010001001000010000100010001010100000001 10000000110010001001000010000100010001010100000001 10000000110010001001000010000100010001010100000000 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9051 @ step 258000 epoch 22.69 loss: 0.016532, lagrangian_loss: 0.003743, attention_score_distillation_loss: 0.000010 loss: 0.023074, lagrangian_loss: 0.001229, attention_score_distillation_loss: 0.000010 ETA: 12:45:50 | Epoch 25 finished. Took 3264.97 seconds. ---------------------------------------------------------------------- time: 2023-07-20 14:32:01 Evaluating: accuracy: 0.9055, eval_loss: 0.4988, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 297000 lambda_1: -0.8036, lambda_2: 1711.3834 lambda_3: 0.0000 train remain: [0.74 0.66 0.43 0.43 0.31 0.27 0.36 0.56 0.26] infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.28, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110101000000000 10101111110111111111010110111101111010010100100100 10111111111111111001110110000000000000000000000000 10000111110110101011011010011100010001000001000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010001010100000011 10000000110010001001000010000100010001010100000001 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9051 @ step 258000 epoch 22.69 Saving the best model so far: [Epoch 26 | Step: 297000 | MACs sparsity: 0.6594 | Score: 0.9055 | Loss: 0.4988] loss: 0.017057, lagrangian_loss: 0.002419, attention_score_distillation_loss: 0.000010 loss: 0.013105, lagrangian_loss: 0.003510, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 14:46:45 Evaluating: accuracy: 0.9042, eval_loss: 0.5095, token_prune_loc: [True, True, True, True, True, True, True, False, True], macs_sparsity: 0.6594, expected_sparsity: 0.6507, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 300000 lambda_1: -0.3759, lambda_2: 1729.3628 lambda_3: 0.0000 train remain: [0.74 0.66 0.43 0.43 0.32 0.27 0.4 0.64 0.26] infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.26, 0.28, 1.0, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110110000000000 10101111110111111111010110111101111010010100100100 10111111111111111001100110000100000000000000000000 10000111110110101011011010011100010001000001000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010001010100000001 10000000110010001001000010000100010001010100000011 11111111111111111111111111111111111111111111111111 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9055 @ step 297000 epoch 26.12 loss: 0.407412, lagrangian_loss: 0.002169, attention_score_distillation_loss: 0.000010 loss: 0.058278, lagrangian_loss: 0.004591, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 15:00:59 Evaluating: accuracy: 0.9059, eval_loss: 0.5163, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 303000 lambda_1: -1.1274, lambda_2: 1746.3228 lambda_3: 0.0000 train remain: [0.75 0.66 0.42 0.43 0.31 0.26 0.39 0.49 0.25] infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.26, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10111111110011111111010110111101111010010100100100 10111111111111111001100110000100000000000000000000 10000111110110101011011010011101010001000000000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010101010100000000 10000000110010001001000010000100010001010100000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9055 @ step 297000 epoch 26.12 Saving the best model so far: [Epoch 26 | Step: 303000 | MACs sparsity: 0.6594 | Score: 0.9059 | Loss: 0.5163] loss: 0.231925, lagrangian_loss: 0.000309, attention_score_distillation_loss: 0.000010 loss: 0.284112, lagrangian_loss: 0.019746, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 15:15:54 Evaluating: accuracy: 0.9043, eval_loss: 0.5117, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 306000 lambda_1: -0.6597, lambda_2: 1763.6797 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.43 0.32 0.27 0.32 0.34 0.25] infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.26, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100100000000 10101111111011111111010110111101111010010100100100 10111111111111111000100110000000001010000000000000 10000111110110101011011010011101010001000000000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010001010100000001 10000000110010001001000010000100010001010100000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9059 @ step 303000 epoch 26.65 loss: 0.010959, lagrangian_loss: 0.000706, attention_score_distillation_loss: 0.000010 ETA: 11:51:39 | Epoch 26 finished. Took 3348.45 seconds. loss: 0.005960, lagrangian_loss: 0.002558, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 15:30:09 Evaluating: accuracy: 0.906, eval_loss: 0.5142, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6521, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 309000 lambda_1: -1.8645, lambda_2: 1780.6941 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.42 0.31 0.26 0.32 0.4 0.24] infer remain: [0.74, 0.66, 0.4, 0.42, 0.32, 0.26, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111111111111111110100000000000 10101111110011111111010110111101111110010100100100 10111111111111111000100110000000001000000000000000 10000111110110101011011010011101010001000000000000 10000011110010101001010010001100010000010100000000 10000000110010001001000010000100010001010100000001 10000000110010001001000010000100010001010100000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9059 @ step 303000 epoch 26.65 Saving the best model so far: [Epoch 27 | Step: 309000 | MACs sparsity: 0.6622 | Score: 0.906 | Loss: 0.5142] loss: 0.013322, lagrangian_loss: 0.012124, attention_score_distillation_loss: 0.000010 loss: 0.029035, lagrangian_loss: -0.000044, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 15:44:50 Evaluating: accuracy: 0.9045, eval_loss: 0.5212, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6521, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 312000 lambda_1: -1.0970, lambda_2: 1797.9958 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.42 0.31 0.26 0.36 0.35 0.25] infer remain: [0.74, 0.66, 0.4, 0.42, 0.32, 0.26, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111111100000000000 10101111110011111111010110111101111010011100100100 10111111111111111000100110001000000000000000000000 10000111110110101011011010011100010001000001000000 10000001110010101001010010001101010001010000000000 10000010110010001001000010000101010001010000000000 10000000110010001001000010000100010001010100000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9060 @ step 309000 epoch 27.17 loss: 0.012972, lagrangian_loss: 0.000404, attention_score_distillation_loss: 0.000010 loss: 0.004734, lagrangian_loss: 0.009444, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 15:58:59 Evaluating: accuracy: 0.9055, eval_loss: 0.5199, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6521, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 315000 lambda_1: -0.7523, lambda_2: 1815.5365 lambda_3: 0.0000 train remain: [0.76 0.66 0.41 0.42 0.31 0.26 0.37 0.36 0.25] infer remain: [0.74, 0.66, 0.4, 0.42, 0.32, 0.26, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000001000 10101111110011111111010110111101111110010100100100 10111111111111111000101110000000000000000000000000 10000111110110101011011010011100010001010000000000 10000001110010101011010010001100010001010000000000 10000000110010001001000010000100010001010100000001 10000000110010001001000010000100010001010100000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9060 @ step 309000 epoch 27.17 loss: 0.005888, lagrangian_loss: 0.005344, attention_score_distillation_loss: 0.000010 loss: 0.011883, lagrangian_loss: 0.010148, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 16:13:09 Evaluating: accuracy: 0.9048, eval_loss: 0.5194, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 318000 lambda_1: -0.1479, lambda_2: 1832.5874 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.42 0.32 0.26 0.39 0.38 0.25] infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.26, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111111111111111110100000000000 10101111110011111111010111111101111010010100100100 10111111111111111000100110000100001000000000000000 10000111110110101011011010011100010001010000000000 10000001110010101001010010001100010100010100000000 10000001110010001001000010000100010001010100000000 10000000110010001001000010000100010001010100000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9060 @ step 309000 epoch 27.17 loss: 0.012396, lagrangian_loss: 0.000386, attention_score_distillation_loss: 0.000010 ETA: 10:57:02 | Epoch 27 finished. Took 3300.38 seconds. loss: 0.111784, lagrangian_loss: 0.001690, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 16:27:24 Evaluating: accuracy: 0.9053, eval_loss: 0.5271, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6521, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 321000 lambda_1: -0.3673, lambda_2: 1849.7450 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.42 0.32 0.26 0.37 0.36 0.26] infer remain: [0.74, 0.66, 0.4, 0.42, 0.32, 0.26, 0.28, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000010000 10101111110011111111010110111101111110010100100100 10111111111111111000100110001000000000000000000000 10000111110110101011011010011100010001000100000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010001010101000000 10000000110010001001000010000100010001010100000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9060 @ step 309000 epoch 27.17 loss: 0.359683, lagrangian_loss: 0.014730, attention_score_distillation_loss: 0.000010 loss: 0.254084, lagrangian_loss: 0.002681, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 16:41:37 Evaluating: accuracy: 0.9052, eval_loss: 0.5209, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 324000 lambda_1: -0.5201, lambda_2: 1866.8728 lambda_3: 0.0000 train remain: [0.75 0.66 0.42 0.42 0.32 0.26 0.34 0.34 0.26] infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.26, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10111111110011111111010110111101111010010100100100 10111111111111111000100110001000000000100000000000 10000111110110101011011010011100010001000000100000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010001010101000000 10000000110010001001000010000100010001010100000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9060 @ step 309000 epoch 27.17 loss: 0.007427, lagrangian_loss: 0.000878, attention_score_distillation_loss: 0.000010 loss: 0.014517, lagrangian_loss: 0.001718, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 16:55:53 Evaluating: accuracy: 0.9052, eval_loss: 0.5356, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6521, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 327000 lambda_1: -0.4991, lambda_2: 1884.3604 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.36 0.38 0.3 ] infer remain: [0.74, 0.66, 0.4, 0.42, 0.32, 0.26, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111010110111101111010010110100100 10111111111111111000100110000000010000000000000000 10001111110110101011011010011100010001000000000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010001100010001010100000000 10000000110010001011000010000100010001010100000000 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9060 @ step 309000 epoch 27.17 loss: 0.203065, lagrangian_loss: 0.011325, attention_score_distillation_loss: 0.000010 loss: 0.146180, lagrangian_loss: 0.000048, attention_score_distillation_loss: 0.000010 ETA: 10:00:56 | Epoch 28 finished. Took 3072.73 seconds. ---------------------------------------------------------------------- time: 2023-07-20 17:10:06 Evaluating: accuracy: 0.9068, eval_loss: 0.5168, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 330000 lambda_1: -0.3473, lambda_2: 1902.1956 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.34 0.41 0.33] infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.26, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110110000000000 11101111110011111111010110111101111010010100100100 10111111111111111110100110000000000000000000000000 10000111110110101011011010011100010001000100000000 10000001110010101001010010001100010100010100000000 10000000110010001001010010000100010001010100000000 10000000110010001011000010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9060 @ step 309000 epoch 27.17 Saving the best model so far: [Epoch 29 | Step: 330000 | MACs sparsity: 0.6594 | Score: 0.9068 | Loss: 0.5168] loss: 0.008257, lagrangian_loss: 0.011549, attention_score_distillation_loss: 0.000010 loss: 0.016565, lagrangian_loss: 0.003160, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 17:24:46 Evaluating: accuracy: 0.9069, eval_loss: 0.5136, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 333000 lambda_1: -0.7132, lambda_2: 1917.9243 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.29 0.33 0.29] infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111010111111101111010010100100100 10111111111111111000110110000000000000000000000000 10000111110110101011011010011100010001000000000000 10000001110010101001010010001100010000010101000000 10000000110010001001000010000100010001010000100001 10000000110010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9068 @ step 330000 epoch 29.02 Saving the best model so far: [Epoch 29 | Step: 333000 | MACs sparsity: 0.6622 | Score: 0.9069 | Loss: 0.5136] loss: 0.006383, lagrangian_loss: 0.003007, attention_score_distillation_loss: 0.000010 loss: 0.257783, lagrangian_loss: 0.001209, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 17:39:22 Evaluating: accuracy: 0.9062, eval_loss: 0.516, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 336000 lambda_1: -1.0286, lambda_2: 1934.3567 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.41 0.31 0.26 0.31 0.28 0.3 ] infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000010000 11101111110011111111010110111101111010010100100100 10111111111111111000100110100000000000000000000000 10000011110110101011011010011101010001000000000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010001010000000011 10000000110010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9069 @ step 333000 epoch 29.29 loss: 0.011127, lagrangian_loss: 0.000437, attention_score_distillation_loss: 0.000010 loss: 0.027935, lagrangian_loss: 0.001384, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 17:53:37 Evaluating: accuracy: 0.9061, eval_loss: 0.5259, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6513, expected_sequence_sparsity: 0.9191, target_sparsity: 0.65, step: 339000 lambda_1: -0.3205, lambda_2: 1951.8823 lambda_3: 0.0000 train remain: [0.75 0.66 0.42 0.41 0.32 0.26 0.36 0.27 0.31] infer remain: [0.74, 0.66, 0.42, 0.4, 0.32, 0.26, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110110000000000 10101111110111111111010110111101111010010100100100 10111111111111111000100110100000000000000100000000 10000111110110101011011010011100010001000000000000 10000001110010101001010010001100010001010100000000 10000001110010001011000010000100010001010000000000 10000000110010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9069 @ step 333000 epoch 29.29 loss: 0.010422, lagrangian_loss: 0.000014, attention_score_distillation_loss: 0.000010 loss: 0.048148, lagrangian_loss: 0.000488, attention_score_distillation_loss: 0.000010 ETA: 9:06:35 | Epoch 29 finished. Took 3328.34 seconds. ---------------------------------------------------------------------- time: 2023-07-20 18:07:48 Evaluating: accuracy: 0.9068, eval_loss: 0.5195, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 342000 lambda_1: -0.3236, lambda_2: 1968.7819 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.4 0.31 0.39] infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.26, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110111111111010110111101111010010100100100 10111111111111111100100110000000000100000000000000 10000011110110101011011010011101010001000000100000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010101010100000000 10000000110010001001000010000100010001010100000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9069 @ step 333000 epoch 29.29 loss: 0.005336, lagrangian_loss: 0.001894, attention_score_distillation_loss: 0.000010 loss: 0.006858, lagrangian_loss: 0.001738, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 18:22:02 Evaluating: accuracy: 0.9067, eval_loss: 0.5062, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6594, expected_sparsity: 0.6508, expected_sequence_sparsity: 0.919, target_sparsity: 0.65, step: 345000 lambda_1: -0.2562, lambda_2: 1985.3673 lambda_3: 0.0000 train remain: [0.75 0.66 0.42 0.41 0.32 0.26 0.34 0.31 0.37] infer remain: [0.74, 0.66, 0.42, 0.42, 0.32, 0.26, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.21, 0.09, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111011110111101111010010100100100 10111111111111111000100110000100000100000000000000 10000011110110101011011010011100011011000000000000 10000001110010101001010010001100010100010100000000 10000000110010001001000010000100010101010100000000 10000000110010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9069 @ step 333000 epoch 29.29 loss: 0.043365, lagrangian_loss: 0.002136, attention_score_distillation_loss: 0.000010 loss: 0.011090, lagrangian_loss: 0.000438, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 18:36:21 Evaluating: accuracy: 0.9062, eval_loss: 0.513, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 348000 lambda_1: -0.3346, lambda_2: 2002.7759 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.32 0.33 0.39] infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111011110111101111010010100100100 10111111111111111000100110000000000000000000000100 10000111110110101011011010011100010001000000000000 10000011110010101001010010001100010000010100000000 10000000110010001001010010000100010001010100000000 10000000110010001001000010000100010001010100000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9069 @ step 333000 epoch 29.29 loss: 0.007361, lagrangian_loss: 0.004158, attention_score_distillation_loss: 0.000010 loss: 0.007597, lagrangian_loss: 0.000635, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 18:50:34 Evaluating: accuracy: 0.9061, eval_loss: 0.5121, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6521, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 351000 lambda_1: -0.5228, lambda_2: 2020.1655 lambda_3: 0.0000 train remain: [0.75 0.66 0.42 0.41 0.32 0.26 0.31 0.33 0.38] infer remain: [0.74, 0.66, 0.4, 0.42, 0.32, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110111111111010110111101111010010100100100 10111111111111111000100110000000000000000010000000 10000111110110101011011010011100010101000000000000 10000001110010101001010010001100010100010100000000 10000000110010001001000010000100010101010100000000 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9069 @ step 333000 epoch 29.29 loss: 0.011360, lagrangian_loss: 0.006206, attention_score_distillation_loss: 0.000010 loss: 0.023159, lagrangian_loss: 0.039438, attention_score_distillation_loss: 0.000010 ETA: 8:11:57 | Epoch 30 finished. Took 3284.77 seconds. ---------------------------------------------------------------------- time: 2023-07-20 19:04:47 Evaluating: accuracy: 0.9064, eval_loss: 0.5295, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6521, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 354000 lambda_1: -0.7080, lambda_2: 2037.1526 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.28 0.32 0.34] infer remain: [0.74, 0.66, 0.4, 0.42, 0.32, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110111111111010110111101111010010100100100 10111111111111111000100110000000000000000010000000 10000011110110101011011010011100010101000100000000 10000001110010101001010010001100010001010100000000 10000001110010001001000010000100010001010100000000 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9069 @ step 333000 epoch 29.29 loss: 0.010760, lagrangian_loss: 0.000093, attention_score_distillation_loss: 0.000010 loss: 0.017829, lagrangian_loss: 0.003529, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 19:19:00 Evaluating: accuracy: 0.9048, eval_loss: 0.5163, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 357000 lambda_1: -0.5968, lambda_2: 2054.3672 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.28 0.35 0.3 ] infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111010110111101111010011100100100 10111111111111111000100110000000100000000000000000 10000011110110101011011010011100010101000000000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010001010100000001 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9069 @ step 333000 epoch 29.29 loss: 0.006646, lagrangian_loss: 0.005841, attention_score_distillation_loss: 0.000010 loss: 0.018823, lagrangian_loss: 0.003421, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 19:33:21 Evaluating: accuracy: 0.9069, eval_loss: 0.5159, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 360000 lambda_1: -0.6352, lambda_2: 2071.4238 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.3 0.39 0.31] infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111010110111111111010010100100100 10111111111111111100100110000000000000000000000000 10000011110110101011011010011100010101000000000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010001010100000001 10000000110010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9069 @ step 333000 epoch 29.29 loss: 0.368457, lagrangian_loss: 0.000409, attention_score_distillation_loss: 0.000010 loss: 0.008010, lagrangian_loss: 0.000022, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 19:47:38 Evaluating: accuracy: 0.9071, eval_loss: 0.5138, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6521, expected_sequence_sparsity: 0.9193, target_sparsity: 0.65, step: 363000 lambda_1: -0.1604, lambda_2: 2088.8484 lambda_3: 0.0000 train remain: [0.74 0.67 0.41 0.41 0.33 0.26 0.3 0.34 0.28] infer remain: [0.74, 0.66, 0.4, 0.42, 0.32, 0.26, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110111111111010110111101111010010100100100 10111111111111111000100110000100000000000000000000 10000011110110101011011010011100010101000010000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010001010101000000 10000000110010001001000010000100010001010100000001 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9069 @ step 333000 epoch 29.29 Saving the best model so far: [Epoch 31 | Step: 363000 | MACs sparsity: 0.6622 | Score: 0.9071 | Loss: 0.5138] loss: 0.011156, lagrangian_loss: 0.000268, attention_score_distillation_loss: 0.000010 ETA: 7:17:25 | Epoch 31 finished. Took 3309.39 seconds. loss: 0.007335, lagrangian_loss: 0.005575, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 20:02:16 Evaluating: accuracy: 0.9071, eval_loss: 0.5005, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 366000 lambda_1: -0.3659, lambda_2: 2106.5913 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.41 0.33 0.26 0.3 0.36 0.3 ] infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.26, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10111111110011111111010110111101111010010100100100 10111111111111111000100110000000001000000000000000 10010011110110101011011010011100010001000000000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010001010100000001 10000000110010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9071 @ step 363000 epoch 31.92 loss: 0.009722, lagrangian_loss: 0.000107, attention_score_distillation_loss: 0.000010 loss: 0.310500, lagrangian_loss: 0.002695, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 20:16:34 Evaluating: accuracy: 0.9072, eval_loss: 0.5043, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 369000 lambda_1: -0.4041, lambda_2: 2122.8223 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.41 0.33 0.26 0.3 0.37 0.28] infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111111011111111010110111101111010010100100100 10111111111111111000100110010000000000000000000000 10000111110110101011011010011100010001000000000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010001010100000001 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9071 @ step 363000 epoch 31.92 Saving the best model so far: [Epoch 32 | Step: 369000 | MACs sparsity: 0.6622 | Score: 0.9072 | Loss: 0.5043] loss: 0.201731, lagrangian_loss: 0.000303, attention_score_distillation_loss: 0.000010 loss: 0.012930, lagrangian_loss: 0.001717, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 20:31:05 Evaluating: accuracy: 0.9068, eval_loss: 0.5058, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 372000 lambda_1: -0.4263, lambda_2: 2139.8025 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.41 0.33 0.26 0.29 0.34 0.3 ] infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100010000000 10101111110011111111010110111101111110010100100100 10111111111111111000110110000000000000000000000000 10000111110110101011011010011100010001000000000000 10000001110010101001010010001100010001010100000000 10000010110010001001000010000100010001010100000000 10000000110010001001000010000100010001010100000000 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9072 @ step 369000 epoch 32.45 loss: 0.006624, lagrangian_loss: 0.010225, attention_score_distillation_loss: 0.000010 loss: 0.011726, lagrangian_loss: 0.000200, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 20:45:16 Evaluating: accuracy: 0.9066, eval_loss: 0.5218, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 375000 lambda_1: -0.6381, lambda_2: 2157.4548 lambda_3: 0.0000 train remain: [0.75 0.66 0.42 0.41 0.33 0.26 0.31 0.32 0.32] infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000100000 10101111110011111111010110111101111010010101100100 10111111111111111000110110000000000000000000000000 10000011110110101011011010011100010001000100000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010001010110000000 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9072 @ step 369000 epoch 32.45 loss: 0.010423, lagrangian_loss: 0.002671, attention_score_distillation_loss: 0.000010 ETA: 6:22:48 | Epoch 32 finished. Took 3299.15 seconds. loss: 0.005749, lagrangian_loss: 0.009991, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 20:59:29 Evaluating: accuracy: 0.9086, eval_loss: 0.4983, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 378000 lambda_1: -0.5325, lambda_2: 2174.3752 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.41 0.33 0.26 0.3 0.32 0.28] infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110111111111010110111101111010010100100100 10111111111111111001100110000000000000000000000000 10000011110110101011011010011101010001000000000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010101010100000000 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9072 @ step 369000 epoch 32.45 Saving the best model so far: [Epoch 33 | Step: 378000 | MACs sparsity: 0.6622 | Score: 0.9086 | Loss: 0.4983] loss: 0.005285, lagrangian_loss: 0.002330, attention_score_distillation_loss: 0.000010 loss: 0.022412, lagrangian_loss: 0.000735, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 21:14:11 Evaluating: accuracy: 0.9071, eval_loss: 0.5111, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6525, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 381000 lambda_1: -0.7453, lambda_2: 2191.4399 lambda_3: 0.0000 train remain: [0.75 0.66 0.41 0.41 0.32 0.26 0.29 0.35 0.27] infer remain: [0.74, 0.66, 0.4, 0.4, 0.32, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.03, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000100 10101111110011111111110110111101111010010100100100 10111111111111111000100110000001000000000000000000 10000011110110101011011010011101010001000000000000 10000001110010101001010010001100010001010100000000 10000000110010001001000010000100010101010100000000 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.327011, lagrangian_loss: 0.005363, attention_score_distillation_loss: 0.000010 loss: 0.007561, lagrangian_loss: 0.004921, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 21:28:21 Evaluating: accuracy: 0.907, eval_loss: 0.5143, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 384000 lambda_1: -0.2550, lambda_2: 2208.1333 lambda_3: 0.0000 train remain: [0.74 0.67 0.4 0.41 0.31 0.26 0.29 0.36 0.27] infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111011110111101111010010100100100 10111111111111111000100110100000000000000000000000 10000011110110101011011010011100010001010000000000 10000001110010101001010010001100010000010100000000 10000000110010001001000010000100010101010100000000 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.006060, lagrangian_loss: 0.000020, attention_score_distillation_loss: 0.000010 loss: 0.008744, lagrangian_loss: 0.004687, attention_score_distillation_loss: 0.000010 ETA: 5:27:34 | Epoch 33 finished. Took 3097.55 seconds. ---------------------------------------------------------------------- time: 2023-07-20 21:42:37 Evaluating: accuracy: 0.9064, eval_loss: 0.5209, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 387000 lambda_1: -0.2488, lambda_2: 2225.4016 lambda_3: 0.0000 train remain: [0.75 0.67 0.4 0.41 0.32 0.26 0.3 0.43 0.27] infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110111111111010110111101111010010100100100 10111111111111111000110110000000000000000000000000 10001011110110101011011010011100010001000000000000 10000001110010101001010010001100010000010100000000 10000000110010001001010010000100010001010100000000 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.011189, lagrangian_loss: 0.000375, attention_score_distillation_loss: 0.000010 loss: 0.008280, lagrangian_loss: 0.001082, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 21:56:52 Evaluating: accuracy: 0.9076, eval_loss: 0.4986, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 390000 lambda_1: -0.3095, lambda_2: 2242.7463 lambda_3: 0.0000 train remain: [0.75 0.67 0.4 0.4 0.31 0.26 0.3 0.41 0.28] infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110111111111010110111101111010010100100100 10111111111111111000110110000000000000000000000000 10000011110110101011011010011100010001000100000000 10000001110010101001010010001100010001010000000000 10000000110010001001000010000100010001010000000011 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.011095, lagrangian_loss: 0.000680, attention_score_distillation_loss: 0.000010 loss: 0.008105, lagrangian_loss: 0.001251, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 22:11:06 Evaluating: accuracy: 0.907, eval_loss: 0.5086, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 393000 lambda_1: -0.1682, lambda_2: 2260.0564 lambda_3: 0.0000 train remain: [0.75 0.67 0.4 0.4 0.31 0.26 0.29 0.41 0.26] infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110111111111010110111101111010010100100100 10111111111111111000100110000000000000010000000000 10000011110110101011011010011100010001000001000000 10000001110010101001010010001100010001010000000000 10000000110010101001000010000100010001010100000000 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.004605, lagrangian_loss: 0.015639, attention_score_distillation_loss: 0.000010 loss: 0.006394, lagrangian_loss: 0.011431, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 22:25:15 Evaluating: accuracy: 0.908, eval_loss: 0.5, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 396000 lambda_1: -0.6651, lambda_2: 2277.3845 lambda_3: 0.0000 train remain: [0.75 0.66 0.4 0.4 0.31 0.26 0.3 0.42 0.26] infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111010110111111111010010100100100 10111111111111111000100110100000000000000000000000 10000011110110101011011010011100010001000000100000 10000000110010101001010010001100010001010100000000 10000000110010001001000010000100010001010100000001 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.020838, lagrangian_loss: 0.004873, attention_score_distillation_loss: 0.000010 loss: 0.013079, lagrangian_loss: 0.000335, attention_score_distillation_loss: 0.000010 ETA: 4:32:58 | Epoch 34 finished. Took 3274.9 seconds. ---------------------------------------------------------------------- time: 2023-07-20 22:39:28 Evaluating: accuracy: 0.9074, eval_loss: 0.4995, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 399000 lambda_1: -0.8593, lambda_2: 2294.6165 lambda_3: 0.0000 train remain: [0.76 0.67 0.39 0.39 0.31 0.26 0.31 0.5 0.26] infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111110110111101111010010100100100 10111111111111111100100110000000000000000000000000 10000011110110101011011010011101010001000000000000 10000000110010101001010010001100010001010100000000 10000000110010001001010010000100010001010000010000 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.008430, lagrangian_loss: 0.000045, attention_score_distillation_loss: 0.000010 loss: 0.004441, lagrangian_loss: 0.000013, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 22:53:40 Evaluating: accuracy: 0.9077, eval_loss: 0.5016, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 402000 lambda_1: -0.3077, lambda_2: 2311.3203 lambda_3: 0.0000 train remain: [0.75 0.67 0.39 0.4 0.31 0.26 0.32 0.41 0.26] infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111010110111101111010010100101100 10111111111111111000100110000000000000000000000100 10000011110110101011011010011100010001000000010000 10000000110010101001010010001100010001010100000000 10000000110010001001000010001100010001010100000000 10000000110010001001000010000100010001010100000000 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.012902, lagrangian_loss: 0.009601, attention_score_distillation_loss: 0.000010 loss: 0.012333, lagrangian_loss: 0.001954, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 23:07:52 Evaluating: accuracy: 0.9069, eval_loss: 0.5044, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 405000 lambda_1: -0.7087, lambda_2: 2328.8462 lambda_3: 0.0000 train remain: [0.76 0.66 0.39 0.39 0.31 0.26 0.32 0.37 0.25] infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111111111111111110100000000000 10101111110011111111010110111101111010110100100100 10111111111111111000100110000000000000000010000000 10000011110110101011011010011100010001000010000000 10000000110010101001010010001100010001010100000000 10000000110010001001000010000100010001010000000011 10000000110010001001000010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.015108, lagrangian_loss: 0.008105, attention_score_distillation_loss: 0.000010 loss: 0.006259, lagrangian_loss: 0.001557, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 23:22:03 Evaluating: accuracy: 0.9063, eval_loss: 0.4933, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6532, expected_sequence_sparsity: 0.9195, target_sparsity: 0.65, step: 408000 lambda_1: -0.5158, lambda_2: 2345.6768 lambda_3: 0.0000 train remain: [0.76 0.66 0.39 0.39 0.32 0.26 0.33 0.39 0.25] infer remain: [0.74, 0.66, 0.4, 0.38, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111111100000000000 10101111110111111111010110111101111010010100100100 10111111111111111000110110000000000000000000000000 10000011110110101011011010001100010101000000000000 10000000110010101001010010001100010001010100000000 10000000110010001011000010010100010001010000000000 10000000010010001001000010000100010001010100000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.007610, lagrangian_loss: 0.003020, attention_score_distillation_loss: 0.000010 ETA: 3:38:22 | Epoch 35 finished. Took 3274.06 seconds. loss: 0.011378, lagrangian_loss: 0.009742, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 23:36:12 Evaluating: accuracy: 0.905, eval_loss: 0.4854, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 411000 lambda_1: -0.1714, lambda_2: 2362.8928 lambda_3: 0.0000 train remain: [0.76 0.66 0.39 0.39 0.33 0.26 0.31 0.35 0.25] infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000010 10101111110111111111010110111101111010010100100100 00111111111111111000100110010000100000000000000000 10000011110110101011011010001100010001010000100000 10000000110010101001010010001100010001010100000000 10000000110010001001000010000100010001010110000000 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.014448, lagrangian_loss: 0.002844, attention_score_distillation_loss: 0.000010 loss: 0.005291, lagrangian_loss: 0.000073, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-20 23:50:25 Evaluating: accuracy: 0.906, eval_loss: 0.5015, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6543, expected_sequence_sparsity: 0.9198, target_sparsity: 0.65, step: 414000 lambda_1: -0.4862, lambda_2: 2379.9822 lambda_3: 0.0000 train remain: [0.76 0.66 0.39 0.39 0.33 0.26 0.28 0.32 0.26] infer remain: [0.74, 0.66, 0.38, 0.38, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.19, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111010110111101111010010101100100 00111111111111111000100110000000001000000000000000 10000011110110101011011010001100010101000000000000 10000000110010101001010010001100010001010100000000 10000000110010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.005761, lagrangian_loss: 0.000847, attention_score_distillation_loss: 0.000010 loss: 0.020640, lagrangian_loss: 0.000082, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-21 00:04:36 Evaluating: accuracy: 0.9053, eval_loss: 0.4972, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6532, expected_sequence_sparsity: 0.9195, target_sparsity: 0.65, step: 417000 lambda_1: -0.3041, lambda_2: 2397.6174 lambda_3: 0.0000 train remain: [0.76 0.66 0.39 0.39 0.33 0.26 0.27 0.32 0.25] infer remain: [0.74, 0.66, 0.4, 0.38, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000010 10101111110011111111011110111101111010010100100100 00111111111111111000110110000100000000000000000000 10001011110110101011011010001100010001000000000000 10000000110010101001010010001100010001010100000000 10000000110010001001000010001100010001010100000000 10000000010010001001000010000100010001010100000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.149664, lagrangian_loss: 0.001193, attention_score_distillation_loss: 0.000010 loss: 0.005796, lagrangian_loss: 0.002503, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-21 00:18:50 Evaluating: accuracy: 0.905, eval_loss: 0.4998, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6532, expected_sequence_sparsity: 0.9195, target_sparsity: 0.65, step: 420000 lambda_1: -0.2627, lambda_2: 2414.6377 lambda_3: 0.0000 train remain: [0.76 0.66 0.39 0.39 0.32 0.26 0.28 0.32 0.25] infer remain: [0.74, 0.66, 0.4, 0.38, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110110000000000 10101111110011111111010110111101111011010100100100 00111111111111111100100110000100000000000000000000 10000011110110101011011010001101010001000000000000 10000000110010101001010010001100010001010100000000 10000001110010001001000010000100010001010100000000 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.014705, lagrangian_loss: 0.000021, attention_score_distillation_loss: 0.000010 ETA: 2:43:46 | Epoch 36 finished. Took 3272.29 seconds. loss: 0.009032, lagrangian_loss: 0.010806, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-21 00:33:03 Evaluating: accuracy: 0.9047, eval_loss: 0.503, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6532, expected_sequence_sparsity: 0.9195, target_sparsity: 0.65, step: 423000 lambda_1: -0.2451, lambda_2: 2431.6763 lambda_3: 0.0000 train remain: [0.76 0.66 0.39 0.39 0.31 0.26 0.27 0.29 0.26] infer remain: [0.74, 0.66, 0.4, 0.38, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000000010 10101111110011111111110110111101111010010100100100 00111111111111111000100110000000000000000000100100 10000011110110101011011010001100010101000000000000 10000000110010101001010010001100010001010100000000 10000000110010001011000010000100010001010100000000 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.107755, lagrangian_loss: 0.003145, attention_score_distillation_loss: 0.000010 loss: 0.015333, lagrangian_loss: 0.001036, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-21 00:47:16 Evaluating: accuracy: 0.9048, eval_loss: 0.4962, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6532, expected_sequence_sparsity: 0.9195, target_sparsity: 0.65, step: 426000 lambda_1: -0.3200, lambda_2: 2448.7913 lambda_3: 0.0000 train remain: [0.76 0.66 0.39 0.39 0.32 0.26 0.28 0.32 0.28] infer remain: [0.74, 0.66, 0.4, 0.38, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100001000000 10101111110111111111010110111101111010010100100100 00111111111111111000100110000000100010000000000000 10000011110110101011011010001101010001000000000000 10000000110010101001010010001100010001010100000000 10000001110010001001000010000100010001010100000000 10000000010010001001000010000100010001010100000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.007883, lagrangian_loss: 0.010295, attention_score_distillation_loss: 0.000010 loss: 0.009252, lagrangian_loss: 0.000482, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-21 01:01:30 Evaluating: accuracy: 0.9046, eval_loss: 0.4996, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 429000 lambda_1: -0.2840, lambda_2: 2466.2822 lambda_3: 0.0000 train remain: [0.76 0.66 0.39 0.39 0.32 0.26 0.27 0.37 0.29] infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111111100000000000 10101111110011111111010110111101111010110100100100 00111111111111111000100110000000100010000000000000 10000011110110101011011010001101010001010000000000 10000000110010101001010010001100010001010100000000 10000000110010001001000010000100010101010001000000 10000000010010001001010010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.009898, lagrangian_loss: 0.002330, attention_score_distillation_loss: 0.000010 loss: 0.001772, lagrangian_loss: 0.000038, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-21 01:15:37 Evaluating: accuracy: 0.9044, eval_loss: 0.5015, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6622, expected_sparsity: 0.6527, expected_sequence_sparsity: 0.9194, target_sparsity: 0.65, step: 432000 lambda_1: -0.0957, lambda_2: 2483.9102 lambda_3: 0.0000 train remain: [0.76 0.65 0.39 0.39 0.32 0.26 0.26 0.49 0.32] infer remain: [0.74, 0.66, 0.4, 0.4, 0.3, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.49, 0.2, 0.08, 0.02, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111111100000000000 10101111110011111111011110111101111010010100100100 00111111111111111101100110000000000000000000000000 10000011110110101011011010001100010001000101000000 10000000110010101001010010001100010001010100000000 10000000110010101001000010000100010001010100000000 10000000010010001001010010000100010001010000000001 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.003926, lagrangian_loss: 0.000002, attention_score_distillation_loss: 0.000010 ETA: 1:49:11 | Epoch 37 finished. Took 3272.06 seconds. loss: 0.007299, lagrangian_loss: 0.000688, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-21 01:29:49 Evaluating: accuracy: 0.9042, eval_loss: 0.5107, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6649, expected_sparsity: 0.6552, expected_sequence_sparsity: 0.92, target_sparsity: 0.65, step: 435000 lambda_1: -0.3987, lambda_2: 2501.2861 lambda_3: 0.0000 train remain: [0.76 0.65 0.39 0.39 0.3 0.26 0.26 0.45 0.32] infer remain: [0.74, 0.64, 0.4, 0.38, 0.28, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.47, 0.19, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111111100000000000 10101111110011111111010110111101111010010100100100 00111111111111111010100110000000000000000000100000 10000111110110101011011010001100010001000000000000 10000000110010101001010010001100010001010000000000 10000000110010001001000010001101010001010000000000 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.121294, lagrangian_loss: 0.006647, attention_score_distillation_loss: 0.000010 loss: 0.006836, lagrangian_loss: 0.000367, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-21 01:44:03 Evaluating: accuracy: 0.9053, eval_loss: 0.5002, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6649, expected_sparsity: 0.6552, expected_sequence_sparsity: 0.92, target_sparsity: 0.65, step: 438000 lambda_1: -0.3716, lambda_2: 2518.7251 lambda_3: 0.0000 train remain: [0.77 0.65 0.39 0.39 0.29 0.26 0.25 0.42 0.31] infer remain: [0.74, 0.64, 0.4, 0.38, 0.28, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.47, 0.19, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000100000 10101111110011111111010110111101111010010100100100 00111111111111111000110110000100000000000000000000 10000011110110101011011010001100010001000010000000 10000000110010101001010010001100010001010000000000 10000000110010001001000010001100010001010100000000 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.005267, lagrangian_loss: 0.004462, attention_score_distillation_loss: 0.000010 loss: 0.004606, lagrangian_loss: 0.000595, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-21 01:58:12 Evaluating: accuracy: 0.9041, eval_loss: 0.5077, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6649, expected_sparsity: 0.6552, expected_sequence_sparsity: 0.92, target_sparsity: 0.65, step: 441000 lambda_1: -0.2506, lambda_2: 2535.8037 lambda_3: 0.0000 train remain: [0.77 0.65 0.4 0.39 0.29 0.26 0.25 0.45 0.33] infer remain: [0.74, 0.64, 0.4, 0.38, 0.28, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.47, 0.19, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0] 10111111111111111111111110111111111110100000100000 10101111110011111111010110111101111010010100100100 00111111111111111000100110000101000000000000000000 10000011110110101011011010001100010001100000000000 10000000110010101001010010000100010001010000001000 10000001110010001001010010000100010001010000000000 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.013767, lagrangian_loss: 0.000631, attention_score_distillation_loss: 0.000010 loss: 0.013047, lagrangian_loss: 0.001207, attention_score_distillation_loss: 0.000010 ETA: 0:54:30 | Epoch 38 finished. Took 3064.35 seconds. ---------------------------------------------------------------------- time: 2023-07-21 02:12:26 Evaluating: accuracy: 0.9066, eval_loss: 0.5079, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6649, expected_sparsity: 0.6564, expected_sequence_sparsity: 0.9203, target_sparsity: 0.65, step: 444000 lambda_1: -0.7933, lambda_2: 2553.1509 lambda_3: 0.0000 train remain: [0.77 0.64 0.39 0.39 0.28 0.26 0.24 0.39 0.27] infer remain: [0.74, 0.64, 0.38, 0.38, 0.28, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.47, 0.18, 0.07, 0.02, 0.0, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111010110111101111010010100001100 00111111111111111001100110000000000000000000000000 10000011110110101011011010001100010101000000000000 10000000110010101001010010000100010001010100000000 10000000110010001001010010000100010001010010000000 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.007001, lagrangian_loss: 0.000182, attention_score_distillation_loss: 0.000010 loss: 0.317712, lagrangian_loss: 0.014128, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-21 02:26:37 Evaluating: accuracy: 0.9062, eval_loss: 0.498, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6649, expected_sparsity: 0.6564, expected_sequence_sparsity: 0.9203, target_sparsity: 0.65, step: 447000 lambda_1: -0.5744, lambda_2: 2570.6865 lambda_3: 0.0000 train remain: [0.77 0.64 0.39 0.39 0.28 0.26 0.24 0.4 0.28] infer remain: [0.74, 0.64, 0.38, 0.38, 0.28, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.47, 0.18, 0.07, 0.02, 0.0, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111011110111101111010010100000100 00111111111111111010100110000000000000000000000000 10000011110110101011011010001101010001000000000000 10000000110010101011010010000100010001010000000000 10000000110010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.003147, lagrangian_loss: 0.000686, attention_score_distillation_loss: 0.000010 loss: 0.247291, lagrangian_loss: 0.001221, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-21 02:40:49 Evaluating: accuracy: 0.9072, eval_loss: 0.5071, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6649, expected_sparsity: 0.6552, expected_sequence_sparsity: 0.92, target_sparsity: 0.65, step: 450000 lambda_1: -0.3225, lambda_2: 2587.6221 lambda_3: 0.0000 train remain: [0.77 0.64 0.39 0.39 0.27 0.26 0.24 0.46 0.27] infer remain: [0.74, 0.64, 0.4, 0.38, 0.28, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.47, 0.19, 0.07, 0.02, 0.01, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 11101111110011111111010110111101111010010100000100 00111111111111111010100110000000000000010000000000 10000011110110101011011010001100011001000000000000 10000000110010101001000010010100010001010010000000 10000000110010001001000010000100010001010100100000 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.020886, lagrangian_loss: 0.001091, attention_score_distillation_loss: 0.000010 loss: 0.011388, lagrangian_loss: 0.000267, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-21 02:54:59 Evaluating: accuracy: 0.9062, eval_loss: 0.5078, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6649, expected_sparsity: 0.6565, expected_sequence_sparsity: 0.9203, target_sparsity: 0.65, step: 453000 lambda_1: -0.6912, lambda_2: 2605.0227 lambda_3: 0.0000 train remain: [0.78 0.64 0.39 0.39 0.27 0.26 0.24 0.36 0.26] infer remain: [0.74, 0.64, 0.38, 0.38, 0.26, 0.26, 0.24, 0.24, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.47, 0.18, 0.07, 0.02, 0.0, 0.0, 0.0, 0.0] 11111111111111111111111110111111111110100000000000 10101111110011111111010110111101111010010100000101 00111111111111111000100110000100000000000000000000 10000011110110101011011010011100010001000000000000 10000000110010101001000010000100010001010100000000 10000000110010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 10000000010010001001000010000100010001010000000011 Best eval score so far: 0.9086 @ step 378000 epoch 33.24 loss: 0.007881, lagrangian_loss: 0.006042, attention_score_distillation_loss: 0.000010 loss: 0.003655, lagrangian_loss: 0.000986, attention_score_distillation_loss: 0.000010 ETA: 0:00:00 | Epoch 39 finished. Took 3270.83 seconds. 07/21/2023 03:03:39 - WARNING - urllib3.connectionpool - Retrying (Retry(total=4, connect=5, read=4, redirect=5, status=5)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='southcentralus.api.azureml.ms', port=443): Read timed out. (read timeout=120)")': /mlflow/v2.0/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourceGroups/gcr-singularity-octo/providers/Microsoft.MachineLearningServices/workspaces/msroctows/api/2.0/mlflow/runs/get?run_uuid=9a5a65ea-641a-4e71-bf7b-573708e6a20c&run_id=9a5a65ea-641a-4e71-bf7b-573708e6a20c