/home/aiscuser/.local/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" 2023/07/19 14:34:25 WARNING mlflow.utils.autologging_utils: You are using an unsupported version of transformers. If you encounter errors during autologging, try upgrading / downgrading transformers to a supported version, or try upgrading MLflow. 2023/07/19 14:34:26 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn. 2023/07/19 14:34:26 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers. Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Downloading and preparing dataset glue/cola to /home/aiscuser/.cache/huggingface/datasets/glue/cola/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353... Downloading data: 0%| | 0.00/377k [00:00 Training Arguments TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, bf16=False, bf16_full_eval=False, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_steps=50, evaluation_strategy=IntervalStrategy.STEPS, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-06, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=40, log_level_replica=-1, log_on_each_node=True, logging_dir=/mnt/data/device-aware-bert/token_pruning/experiments/CoLA/reproduce1/s0.43_lr2e-06_reglr0.02_alpha1e-05_warmup50_bin20/runs/Jul19_14-34-26_node-0, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=25, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=100.0, optim=OptimizerNames.ADAMW_HF, output_dir=/mnt/data/device-aware-bert/token_pruning/experiments/CoLA/reproduce1/s0.43_lr2e-06_reglr0.02_alpha1e-05_warmup50_bin20, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=32, per_device_train_batch_size=32, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, remove_unused_columns=True, report_to=['mlflow'], resume_from_checkpoint=None, run_name=/mnt/data/device-aware-bert/token_pruning/experiments/CoLA/reproduce1/s0.43_lr2e-06_reglr0.02_alpha1e-05_warmup50_bin20, save_on_each_node=False, save_steps=0, save_strategy=IntervalStrategy.STEPS, save_total_limit=None, seed=57, sharded_ddp=[], skip_memory_metrics=True, tf32=None, tpu_metrics_debug=False, tpu_num_cores=None, use_legacy_prediction_loop=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) Additional Arguments AdditionalArguments(test=False, ex_name='s0.43_lr2e-06_reglr0.02_alpha1e-05_warmup50_bin20', pruning_type='token+pruner', reg_learning_rate=0.02, scheduler_type='linear', freeze_embeddings=True, pretrained_pruned_model=None, droprate_init=0.01, temperature=0.6666666666666666, prepruning_finetune_epochs=1, lagrangian_warmup_epochs=50, target_sparsity=0.43, sparsity_epsilon=0, distillation_path='/mnt/data/device-aware-bert/token_pruning/teachers/CoLA', do_distill=True, do_layer_distill=False, layer_distill_version=4, distill_loss_alpha=0.9, distill_ce_loss_alpha=1e-05, distill_temp=2.0, use_mac_l0=True, prune_location=[2, 3, 4, 5, 6, 7, 8, 9, 10, 11], bin_num=20, topk=10) ---------------------------------------------------------------------- time: 2023-07-19 14:35:15 Evaluating: matthews_correlation: 0.6026, eval_loss: 0.5047, step: 0 lambda_1: 0.0000, lambda_2: 0.0000 lambda_3: 0.0000 Starting l0 regularization! using , temperature: 0.67, init drop rate: 0.01 token_loga shape: [10, 20] prune location: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] NDCG TOPK= 10 loss: 0.052408, lagrangian_loss: -0.000087, attention_score_distillation_loss: 0.000044 ---------------------------------------------------------------------- time: 2023-07-19 14:35:28 Evaluating: matthews_correlation: 0.5677, eval_loss: 0.5793, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.8199, target_sparsity: 0.0015, step: 50 lambda_1: -0.2438, lambda_2: 0.8540 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 loss: 0.023135, lagrangian_loss: 0.000267, attention_score_distillation_loss: 0.000046 loss: 0.051142, lagrangian_loss: 0.001364, attention_score_distillation_loss: 0.000045 ---------------------------------------------------------------------- time: 2023-07-19 14:35:40 Evaluating: matthews_correlation: 0.5728, eval_loss: 0.5744, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.8199, target_sparsity: 0.0031, step: 100 lambda_1: -1.5490, lambda_2: 2.2220 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 loss: 0.123130, lagrangian_loss: -0.001333, attention_score_distillation_loss: 0.000043 loss: 0.072426, lagrangian_loss: -0.006348, attention_score_distillation_loss: 0.000045 ---------------------------------------------------------------------- time: 2023-07-19 14:35:53 Evaluating: matthews_correlation: 0.5624, eval_loss: 0.5941, token_prune_loc: [False, False, False, False, False, False, False, True, False, False], macs_sparsity: 0.0278, expected_sparsity: 0.0138, expected_sequence_sparsity: 0.8224, target_sparsity: 0.0047, step: 150 lambda_1: 0.2773, lambda_2: 4.3952 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 0.99 0.99 1. 0.99] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.95] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111011111111 11111111111111111111 11111111111111111111 loss: 0.084963, lagrangian_loss: 0.004010, attention_score_distillation_loss: 0.000044 loss: 0.060280, lagrangian_loss: 0.008714, attention_score_distillation_loss: 0.000044 ---------------------------------------------------------------------- time: 2023-07-19 14:36:05 Evaluating: matthews_correlation: 0.5624, eval_loss: 0.5967, token_prune_loc: [False, False, False, False, False, False, False, True, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.8199, target_sparsity: 0.0064, step: 200 lambda_1: 1.8112, lambda_2: 5.7940 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 0.99 1. 1. ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 loss: 0.295672, lagrangian_loss: 0.009288, attention_score_distillation_loss: 0.000045 loss: 0.051892, lagrangian_loss: 0.005360, attention_score_distillation_loss: 0.000042 ---------------------------------------------------------------------- time: 2023-07-19 14:36:18 Evaluating: matthews_correlation: 0.5831, eval_loss: 0.5759, token_prune_loc: [False, False, False, False, False, False, False, True, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.8199, target_sparsity: 0.008, step: 250 lambda_1: 2.4073, lambda_2: 6.0494 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 0.99 1. 1. ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 loss: 0.043431, lagrangian_loss: 0.001352, attention_score_distillation_loss: 0.000043 ETA: 1:48:16 | Epoch 0 finished. Took 65.62 seconds. loss: 0.044306, lagrangian_loss: -0.003026, attention_score_distillation_loss: 0.000046 ---------------------------------------------------------------------- time: 2023-07-19 14:36:30 Evaluating: matthews_correlation: 0.5702, eval_loss: 0.5875, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.8199, target_sparsity: 0.0096, step: 300 lambda_1: 2.2977, lambda_2: 6.0863 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 0.99 1. 1. ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 loss: 0.022632, lagrangian_loss: -0.006640, attention_score_distillation_loss: 0.000042 loss: 0.086803, lagrangian_loss: -0.008527, attention_score_distillation_loss: 0.000044 ---------------------------------------------------------------------- time: 2023-07-19 14:36:43 Evaluating: matthews_correlation: 0.5572, eval_loss: 0.6046, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.8199, target_sparsity: 0.0112, step: 350 lambda_1: 1.5253, lambda_2: 6.4221 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 loss: 0.091945, lagrangian_loss: -0.008573, attention_score_distillation_loss: 0.000044 loss: 0.025007, lagrangian_loss: -0.006087, attention_score_distillation_loss: 0.000044 ---------------------------------------------------------------------- time: 2023-07-19 14:36:55 Evaluating: matthews_correlation: 0.5547, eval_loss: 0.6094, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.8199, target_sparsity: 0.0128, step: 400 lambda_1: 0.2517, lambda_2: 7.3617 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 loss: 0.224894, lagrangian_loss: -0.001526, attention_score_distillation_loss: 0.000043 loss: 0.010139, lagrangian_loss: 0.004853, attention_score_distillation_loss: 0.000044 ---------------------------------------------------------------------- time: 2023-07-19 14:37:08 Evaluating: matthews_correlation: 0.5676, eval_loss: 0.596, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.8199, target_sparsity: 0.0144, step: 450 lambda_1: -1.2630, lambda_2: 8.8380 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 loss: 0.030199, lagrangian_loss: 0.012795, attention_score_distillation_loss: 0.000043 loss: 0.024457, lagrangian_loss: 0.022268, attention_score_distillation_loss: 0.000043 ---------------------------------------------------------------------- time: 2023-07-19 14:37:20 Evaluating: matthews_correlation: 0.5729, eval_loss: 0.5818, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.8199, target_sparsity: 0.016, step: 500 lambda_1: -2.8741, lambda_2: 10.6401 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 loss: 0.062125, lagrangian_loss: 0.032385, attention_score_distillation_loss: 0.000044 loss: 0.027900, lagrangian_loss: 0.042968, attention_score_distillation_loss: 0.000044 ETA: 1:46:33 | Epoch 1 finished. Took 64.86 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:37:33 Evaluating: matthews_correlation: 0.5752, eval_loss: 0.601, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.8199, target_sparsity: 0.0176, step: 550 lambda_1: -4.4755, lambda_2: 12.5007 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 0.99 1. 1. ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 loss: 0.017210, lagrangian_loss: 0.049118, attention_score_distillation_loss: 0.000043 loss: 0.013244, lagrangian_loss: 0.033946, attention_score_distillation_loss: 0.000044 ---------------------------------------------------------------------- time: 2023-07-19 14:37:45 Evaluating: matthews_correlation: 0.5701, eval_loss: 0.6131, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.0404, expected_sparsity: 0.0316, expected_sequence_sparsity: 0.8257, target_sparsity: 0.0192, step: 600 lambda_1: -5.3782, lambda_2: 13.5456 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.96 0.99 0.89] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.85] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.81] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 11101111110111111110 loss: 0.018097, lagrangian_loss: -0.093441, attention_score_distillation_loss: 0.000043 loss: 0.011333, lagrangian_loss: -0.203073, attention_score_distillation_loss: 0.000041 ---------------------------------------------------------------------- time: 2023-07-19 14:37:58 Evaluating: matthews_correlation: 0.5781, eval_loss: 0.6422, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1061, expected_sparsity: 0.0978, expected_sequence_sparsity: 0.8377, target_sparsity: 0.0208, step: 650 lambda_1: -2.0244, lambda_2: 17.8981 lambda_3: 0.0000 train remain: [0.99 1. 0.99 1. 1. 0.99 1. 0.91 0.87 0.56] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.85, 0.55] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.76, 0.42] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111010 11111111111111011100 11101111110000100100 loss: 0.079392, lagrangian_loss: -0.039606, attention_score_distillation_loss: 0.000042 loss: 0.004774, lagrangian_loss: 0.047873, attention_score_distillation_loss: 0.000041 ---------------------------------------------------------------------- time: 2023-07-19 14:38:10 Evaluating: matthews_correlation: 0.5838, eval_loss: 0.6168, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.0657, expected_sparsity: 0.0558, expected_sequence_sparsity: 0.8301, target_sparsity: 0.0224, step: 700 lambda_1: 0.5606, lambda_2: 20.4068 lambda_3: 0.0000 train remain: [0.99 1. 0.99 1. 1. 1. 1. 0.95 0.97 0.78] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 1.0, 0.75] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.68] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111010 11111111111111111111 11101111110111110100 loss: 0.019307, lagrangian_loss: 0.046858, attention_score_distillation_loss: 0.000043 loss: 0.027777, lagrangian_loss: 0.028862, attention_score_distillation_loss: 0.000041 ---------------------------------------------------------------------- time: 2023-07-19 14:38:23 Evaluating: matthews_correlation: 0.578, eval_loss: 0.6059, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.0404, expected_sparsity: 0.0316, expected_sequence_sparsity: 0.8257, target_sparsity: 0.024, step: 750 lambda_1: 1.4530, lambda_2: 20.7984 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 0.99 0.9 ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.85] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.81] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 11101111110111111110 loss: 0.069574, lagrangian_loss: 0.011637, attention_score_distillation_loss: 0.000041 loss: 0.422247, lagrangian_loss: -0.001200, attention_score_distillation_loss: 0.000043 ---------------------------------------------------------------------- time: 2023-07-19 14:38:35 Evaluating: matthews_correlation: 0.5911, eval_loss: 0.5785, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.8199, target_sparsity: 0.0256, step: 800 lambda_1: 1.5431, lambda_2: 20.8189 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.99 1. 0.96] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 loss: 0.067921, lagrangian_loss: -0.008891, attention_score_distillation_loss: 0.000042 ETA: 1:48:14 | Epoch 2 finished. Took 70.38 seconds. loss: 0.036717, lagrangian_loss: -0.011372, attention_score_distillation_loss: 0.000044 ---------------------------------------------------------------------- time: 2023-07-19 14:38:48 Evaluating: matthews_correlation: 0.5547, eval_loss: 0.6065, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.8199, target_sparsity: 0.0272, step: 850 lambda_1: 1.2481, lambda_2: 20.8546 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.99 1. 0.97] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 loss: 0.090573, lagrangian_loss: -0.011549, attention_score_distillation_loss: 0.000043 loss: 0.050021, lagrangian_loss: -0.009948, attention_score_distillation_loss: 0.000041 ---------------------------------------------------------------------- time: 2023-07-19 14:39:00 Evaluating: matthews_correlation: 0.5598, eval_loss: 0.5994, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.8199, target_sparsity: 0.0288, step: 900 lambda_1: 0.7736, lambda_2: 20.9379 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.99 1. 0.98] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 loss: 0.012121, lagrangian_loss: -0.006773, attention_score_distillation_loss: 0.000044 loss: 0.055650, lagrangian_loss: -0.002530, attention_score_distillation_loss: 0.000039 ---------------------------------------------------------------------- time: 2023-07-19 14:39:13 Evaluating: matthews_correlation: 0.565, eval_loss: 0.6024, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.8199, target_sparsity: 0.0304, step: 950 lambda_1: 0.1849, lambda_2: 21.0633 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.99 1. 0.98] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 loss: 0.012137, lagrangian_loss: 0.002438, attention_score_distillation_loss: 0.000042 loss: 0.022736, lagrangian_loss: 0.008151, attention_score_distillation_loss: 0.000041 ---------------------------------------------------------------------- time: 2023-07-19 14:39:25 Evaluating: matthews_correlation: 0.5754, eval_loss: 0.5841, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.8199, target_sparsity: 0.032, step: 1000 lambda_1: -0.4737, lambda_2: 21.2186 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.99 1. 0.97] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 loss: 0.011086, lagrangian_loss: 0.013997, attention_score_distillation_loss: 0.000043 loss: 0.074907, lagrangian_loss: 0.018040, attention_score_distillation_loss: 0.000042 ---------------------------------------------------------------------- time: 2023-07-19 14:39:38 Evaluating: matthews_correlation: 0.5701, eval_loss: 0.5929, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.8199, target_sparsity: 0.0336, step: 1050 lambda_1: -1.1301, lambda_2: 21.3722 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.99 1. 0.94] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 loss: 0.022617, lagrangian_loss: 0.018896, attention_score_distillation_loss: 0.000040 ETA: 1:46:26 | Epoch 3 finished. Took 65.23 seconds. loss: 0.012110, lagrangian_loss: 0.013532, attention_score_distillation_loss: 0.000041 ---------------------------------------------------------------------- time: 2023-07-19 14:39:50 Evaluating: matthews_correlation: 0.5881, eval_loss: 0.6024, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0253, expected_sparsity: 0.0251, expected_sequence_sparsity: 0.8245, target_sparsity: 0.0352, step: 1100 lambda_1: -1.5584, lambda_2: 21.4431 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 0.99 0.86] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11101111110111110110 loss: 0.087711, lagrangian_loss: 0.002332, attention_score_distillation_loss: 0.000042 loss: 0.032587, lagrangian_loss: -0.009885, attention_score_distillation_loss: 0.000040 ---------------------------------------------------------------------- time: 2023-07-19 14:40:03 Evaluating: matthews_correlation: 0.5883, eval_loss: 0.5968, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.0657, expected_sparsity: 0.0495, expected_sequence_sparsity: 0.8289, target_sparsity: 0.0368, step: 1150 lambda_1: -1.3885, lambda_2: 21.4746 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.97 0.98 0.75] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.7] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.66] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 11101111110101110100 loss: 0.033407, lagrangian_loss: -0.015933, attention_score_distillation_loss: 0.000040 loss: 0.013245, lagrangian_loss: -0.012230, attention_score_distillation_loss: 0.000042 ---------------------------------------------------------------------- time: 2023-07-19 14:40:15 Evaluating: matthews_correlation: 0.5855, eval_loss: 0.6155, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.0657, expected_sparsity: 0.0495, expected_sequence_sparsity: 0.8289, target_sparsity: 0.0384, step: 1200 lambda_1: -0.6241, lambda_2: 21.6775 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.97 0.98 0.71] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.7] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.66] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 11101111110101110100 loss: 0.014363, lagrangian_loss: -0.004215, attention_score_distillation_loss: 0.000037 loss: 0.045978, lagrangian_loss: 0.002183, attention_score_distillation_loss: 0.000042 ---------------------------------------------------------------------- time: 2023-07-19 14:40:28 Evaluating: matthews_correlation: 0.5885, eval_loss: 0.5896, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0379, expected_sparsity: 0.0376, expected_sequence_sparsity: 0.8267, target_sparsity: 0.04, step: 1250 lambda_1: 0.1177, lambda_2: 21.8686 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.97 0.99 0.75] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11101111110101110100 loss: 0.035560, lagrangian_loss: 0.003440, attention_score_distillation_loss: 0.000041 loss: 0.109357, lagrangian_loss: 0.003341, attention_score_distillation_loss: 0.000041 ---------------------------------------------------------------------- time: 2023-07-19 14:40:41 Evaluating: matthews_correlation: 0.565, eval_loss: 0.6217, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0379, expected_sparsity: 0.0313, expected_sequence_sparsity: 0.8256, target_sparsity: 0.0417, step: 1300 lambda_1: 0.4873, lambda_2: 21.9195 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 0.99 0.78] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.75] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.75] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111110111110110 loss: 0.031928, lagrangian_loss: 0.001994, attention_score_distillation_loss: 0.000040 loss: 0.148130, lagrangian_loss: -0.000321, attention_score_distillation_loss: 0.000041 ETA: 1:44:55 | Epoch 4 finished. Took 65.26 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:40:53 Evaluating: matthews_correlation: 0.5701, eval_loss: 0.6048, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0253, expected_sparsity: 0.0251, expected_sequence_sparsity: 0.8245, target_sparsity: 0.0433, step: 1350 lambda_1: 0.5340, lambda_2: 21.9264 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.99 0.99 0.83] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11101111110111110110 loss: 0.043911, lagrangian_loss: -0.002125, attention_score_distillation_loss: 0.000041 loss: 0.004861, lagrangian_loss: -0.001830, attention_score_distillation_loss: 0.000038 ---------------------------------------------------------------------- time: 2023-07-19 14:41:05 Evaluating: matthews_correlation: 0.5805, eval_loss: 0.594, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0253, expected_sparsity: 0.0251, expected_sequence_sparsity: 0.8245, target_sparsity: 0.0449, step: 1400 lambda_1: 0.1936, lambda_2: 21.9665 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.99 0.99 0.84] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111110111111110 loss: 0.010877, lagrangian_loss: 0.000239, attention_score_distillation_loss: 0.000038 loss: 0.134874, lagrangian_loss: 0.001721, attention_score_distillation_loss: 0.000041 ---------------------------------------------------------------------- time: 2023-07-19 14:41:18 Evaluating: matthews_correlation: 0.5803, eval_loss: 0.6058, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0253, expected_sparsity: 0.0251, expected_sequence_sparsity: 0.8245, target_sparsity: 0.0465, step: 1450 lambda_1: -0.2458, lambda_2: 22.0265 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 0.99 0.81] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11101111110111110110 loss: 0.005813, lagrangian_loss: 0.002361, attention_score_distillation_loss: 0.000041 loss: 0.007074, lagrangian_loss: 0.001915, attention_score_distillation_loss: 0.000040 ---------------------------------------------------------------------- time: 2023-07-19 14:41:30 Evaluating: matthews_correlation: 0.5779, eval_loss: 0.5994, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0379, expected_sparsity: 0.0313, expected_sequence_sparsity: 0.8256, target_sparsity: 0.0481, step: 1500 lambda_1: -0.4864, lambda_2: 22.0473 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 0.99 0.76] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.75] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.75] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111110111110110 loss: 0.004728, lagrangian_loss: 0.000014, attention_score_distillation_loss: 0.000040 loss: 0.074427, lagrangian_loss: -0.001546, attention_score_distillation_loss: 0.000041 ---------------------------------------------------------------------- time: 2023-07-19 14:41:43 Evaluating: matthews_correlation: 0.5856, eval_loss: 0.595, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0379, expected_sparsity: 0.0376, expected_sequence_sparsity: 0.8267, target_sparsity: 0.0497, step: 1550 lambda_1: -0.3805, lambda_2: 22.0549 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 0.99 0.71] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11101111110100110110 loss: 0.011561, lagrangian_loss: -0.001464, attention_score_distillation_loss: 0.000041 loss: 0.018154, lagrangian_loss: -0.000629, attention_score_distillation_loss: 0.000037 ---------------------------------------------------------------------- time: 2023-07-19 14:41:55 Evaluating: matthews_correlation: 0.5963, eval_loss: 0.5875, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0379, expected_sparsity: 0.0376, expected_sequence_sparsity: 0.8267, target_sparsity: 0.0513, step: 1600 lambda_1: -0.0982, lambda_2: 22.0778 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 0.99 0.71] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11101111110101110100 loss: 0.018467, lagrangian_loss: -0.000073, attention_score_distillation_loss: 0.000039 ETA: 1:44:51 | Epoch 5 finished. Took 70.25 seconds. loss: 0.069282, lagrangian_loss: 0.000039, attention_score_distillation_loss: 0.000040 ---------------------------------------------------------------------- time: 2023-07-19 14:42:08 Evaluating: matthews_correlation: 0.5854, eval_loss: 0.6184, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0379, expected_sparsity: 0.0376, expected_sequence_sparsity: 0.8267, target_sparsity: 0.0529, step: 1650 lambda_1: 0.0230, lambda_2: 22.0834 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 0.99 0.72] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11101111110001110110 loss: 0.012715, lagrangian_loss: 0.000010, attention_score_distillation_loss: 0.000039 loss: 0.357839, lagrangian_loss: 0.000042, attention_score_distillation_loss: 0.000040 ---------------------------------------------------------------------- time: 2023-07-19 14:42:20 Evaluating: matthews_correlation: 0.5916, eval_loss: 0.5793, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0505, expected_sparsity: 0.0438, expected_sequence_sparsity: 0.8279, target_sparsity: 0.0545, step: 1700 lambda_1: 0.0341, lambda_2: 22.0839 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 0.99 0.71] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.65] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.65] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11101111110000110110 loss: 0.083917, lagrangian_loss: -0.000000, attention_score_distillation_loss: 0.000040 loss: 1.572352, lagrangian_loss: 0.000162, attention_score_distillation_loss: 0.000040 ---------------------------------------------------------------------- time: 2023-07-19 14:42:33 Evaluating: matthews_correlation: 0.5839, eval_loss: 0.5849, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0505, expected_sparsity: 0.0438, expected_sequence_sparsity: 0.8279, target_sparsity: 0.0561, step: 1750 lambda_1: -0.0737, lambda_2: 22.0874 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 0.99 0.68] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.65] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.65] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111110001110110 loss: 0.022114, lagrangian_loss: 0.000082, attention_score_distillation_loss: 0.000040 loss: 0.008865, lagrangian_loss: 0.000152, attention_score_distillation_loss: 0.000039 ---------------------------------------------------------------------- time: 2023-07-19 14:42:45 Evaluating: matthews_correlation: 0.5854, eval_loss: 0.6137, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0505, expected_sparsity: 0.0501, expected_sequence_sparsity: 0.829, target_sparsity: 0.0577, step: 1800 lambda_1: -0.1378, lambda_2: 22.0886 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 0.99 0.66] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111110001110100 loss: 0.091069, lagrangian_loss: -0.000027, attention_score_distillation_loss: 0.000038 loss: 0.028480, lagrangian_loss: -0.000146, attention_score_distillation_loss: 0.000038 ---------------------------------------------------------------------- time: 2023-07-19 14:42:57 Evaluating: matthews_correlation: 0.589, eval_loss: 0.5918, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0505, expected_sparsity: 0.0501, expected_sequence_sparsity: 0.829, target_sparsity: 0.0593, step: 1850 lambda_1: -0.0790, lambda_2: 22.0903 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 0.99 0.63] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111110001110100 loss: 0.019272, lagrangian_loss: -0.000056, attention_score_distillation_loss: 0.000036 loss: 0.153409, lagrangian_loss: -0.000013, attention_score_distillation_loss: 0.000038 ETA: 1:43:14 | Epoch 6 finished. Took 64.67 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:43:10 Evaluating: matthews_correlation: 0.583, eval_loss: 0.6137, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0505, expected_sparsity: 0.0501, expected_sequence_sparsity: 0.829, target_sparsity: 0.0609, step: 1900 lambda_1: -0.0137, lambda_2: 22.0917 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.63] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111110001110100 loss: 0.047633, lagrangian_loss: -0.000002, attention_score_distillation_loss: 0.000039 loss: 0.029942, lagrangian_loss: -0.000001, attention_score_distillation_loss: 0.000039 ---------------------------------------------------------------------- time: 2023-07-19 14:43:22 Evaluating: matthews_correlation: 0.5625, eval_loss: 0.6049, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0505, expected_sparsity: 0.0501, expected_sequence_sparsity: 0.829, target_sparsity: 0.0625, step: 1950 lambda_1: -0.0107, lambda_2: 22.0919 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.61] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111110000110110 loss: 0.289428, lagrangian_loss: 0.000012, attention_score_distillation_loss: 0.000036 loss: 0.010460, lagrangian_loss: 0.000028, attention_score_distillation_loss: 0.000037 ---------------------------------------------------------------------- time: 2023-07-19 14:43:35 Evaluating: matthews_correlation: 0.5885, eval_loss: 0.5894, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0631, expected_sparsity: 0.0564, expected_sequence_sparsity: 0.8302, target_sparsity: 0.0641, step: 2000 lambda_1: -0.0286, lambda_2: 22.0923 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.61] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.55] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.55] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111010000110110 loss: 0.035992, lagrangian_loss: 0.000053, attention_score_distillation_loss: 0.000038 loss: 0.025590, lagrangian_loss: 0.000013, attention_score_distillation_loss: 0.000038 ---------------------------------------------------------------------- time: 2023-07-19 14:43:47 Evaluating: matthews_correlation: 0.5755, eval_loss: 0.6086, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0631, expected_sparsity: 0.0564, expected_sequence_sparsity: 0.8302, target_sparsity: 0.0657, step: 2050 lambda_1: -0.0605, lambda_2: 22.0928 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.59] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.55] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.55] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111010000110110 loss: 0.010836, lagrangian_loss: -0.000027, attention_score_distillation_loss: 0.000040 loss: 0.200192, lagrangian_loss: 0.000083, attention_score_distillation_loss: 0.000036 ---------------------------------------------------------------------- time: 2023-07-19 14:44:00 Evaluating: matthews_correlation: 0.5916, eval_loss: 0.5759, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0631, expected_sparsity: 0.0564, expected_sequence_sparsity: 0.8302, target_sparsity: 0.0673, step: 2100 lambda_1: -0.0799, lambda_2: 22.0931 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.57] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.55] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.55] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111010000110110 loss: 0.020215, lagrangian_loss: 0.000121, attention_score_distillation_loss: 0.000036 loss: 0.034196, lagrangian_loss: 0.000093, attention_score_distillation_loss: 0.000037 ETA: 1:41:42 | Epoch 7 finished. Took 64.42 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:44:12 Evaluating: matthews_correlation: 0.5889, eval_loss: 0.582, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0631, expected_sparsity: 0.0564, expected_sequence_sparsity: 0.8302, target_sparsity: 0.0689, step: 2150 lambda_1: -0.1308, lambda_2: 22.0939 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.56] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.55] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.55] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11101111010000110100 loss: 0.018474, lagrangian_loss: 0.000202, attention_score_distillation_loss: 0.000037 loss: 0.026257, lagrangian_loss: 0.000212, attention_score_distillation_loss: 0.000036 ---------------------------------------------------------------------- time: 2023-07-19 14:44:24 Evaluating: matthews_correlation: 0.5763, eval_loss: 0.5868, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0631, expected_sparsity: 0.0626, expected_sequence_sparsity: 0.8313, target_sparsity: 0.0705, step: 2200 lambda_1: -0.1810, lambda_2: 22.0950 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.54] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.5] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111010000110100 loss: 0.041358, lagrangian_loss: -0.000192, attention_score_distillation_loss: 0.000039 loss: 0.027932, lagrangian_loss: -0.000118, attention_score_distillation_loss: 0.000037 ---------------------------------------------------------------------- time: 2023-07-19 14:44:37 Evaluating: matthews_correlation: 0.594, eval_loss: 0.5913, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0631, expected_sparsity: 0.0626, expected_sequence_sparsity: 0.8313, target_sparsity: 0.0721, step: 2250 lambda_1: -0.1126, lambda_2: 22.0963 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.52] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.5] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111010000110100 loss: 0.027324, lagrangian_loss: -0.000048, attention_score_distillation_loss: 0.000036 loss: 0.826573, lagrangian_loss: -0.000072, attention_score_distillation_loss: 0.000039 ---------------------------------------------------------------------- time: 2023-07-19 14:44:49 Evaluating: matthews_correlation: 0.588, eval_loss: 0.6145, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0757, expected_sparsity: 0.0689, expected_sequence_sparsity: 0.8325, target_sparsity: 0.0737, step: 2300 lambda_1: -0.0714, lambda_2: 22.0970 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.5 ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.45] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.45] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111000000110100 loss: 0.018660, lagrangian_loss: -0.000056, attention_score_distillation_loss: 0.000038 loss: 0.020474, lagrangian_loss: -0.000011, attention_score_distillation_loss: 0.000037 ---------------------------------------------------------------------- time: 2023-07-19 14:45:01 Evaluating: matthews_correlation: 0.5732, eval_loss: 0.5948, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0757, expected_sparsity: 0.0689, expected_sequence_sparsity: 0.8325, target_sparsity: 0.0753, step: 2350 lambda_1: -0.0216, lambda_2: 22.0980 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.99 1. 0.5 ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.45] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.45] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111000000110100 loss: 0.106117, lagrangian_loss: -0.000004, attention_score_distillation_loss: 0.000038 loss: 0.019980, lagrangian_loss: 0.000005, attention_score_distillation_loss: 0.000038 ---------------------------------------------------------------------- time: 2023-07-19 14:45:14 Evaluating: matthews_correlation: 0.5836, eval_loss: 0.5961, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0757, expected_sparsity: 0.0689, expected_sequence_sparsity: 0.8325, target_sparsity: 0.077, step: 2400 lambda_1: -0.0307, lambda_2: 22.0985 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.99 1. 0.49] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.45] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.45] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111000000110100 loss: 0.016105, lagrangian_loss: 0.000066, attention_score_distillation_loss: 0.000036 ETA: 1:41:12 | Epoch 8 finished. Took 69.86 seconds. loss: 0.016177, lagrangian_loss: 0.000044, attention_score_distillation_loss: 0.000038 ---------------------------------------------------------------------- time: 2023-07-19 14:45:26 Evaluating: matthews_correlation: 0.586, eval_loss: 0.5959, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0757, expected_sparsity: 0.0689, expected_sequence_sparsity: 0.8325, target_sparsity: 0.0786, step: 2450 lambda_1: -0.1256, lambda_2: 22.1005 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.99 1. 0.48] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.45] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.45] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101111000000110100 loss: 0.082660, lagrangian_loss: 0.000135, attention_score_distillation_loss: 0.000037 loss: 0.008940, lagrangian_loss: 0.000107, attention_score_distillation_loss: 0.000037 ---------------------------------------------------------------------- time: 2023-07-19 14:45:39 Evaluating: matthews_correlation: 0.5867, eval_loss: 0.5909, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0757, expected_sparsity: 0.0751, expected_sequence_sparsity: 0.8336, target_sparsity: 0.0802, step: 2500 lambda_1: -0.2101, lambda_2: 22.1021 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.46] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.4] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.4] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101110000000110100 loss: 0.029587, lagrangian_loss: -0.000044, attention_score_distillation_loss: 0.000037 loss: 0.029484, lagrangian_loss: -0.000263, attention_score_distillation_loss: 0.000037 ---------------------------------------------------------------------- time: 2023-07-19 14:45:51 Evaluating: matthews_correlation: 0.5833, eval_loss: 0.6036, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0757, expected_sparsity: 0.0751, expected_sequence_sparsity: 0.8336, target_sparsity: 0.0818, step: 2550 lambda_1: -0.1345, lambda_2: 22.1045 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.43] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.4] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.4] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101110000000110100 loss: 0.315632, lagrangian_loss: -0.000152, attention_score_distillation_loss: 0.000036 loss: 0.007274, lagrangian_loss: -0.000027, attention_score_distillation_loss: 0.000036 ---------------------------------------------------------------------- time: 2023-07-19 14:46:03 Evaluating: matthews_correlation: 0.5808, eval_loss: 0.6172, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0757, expected_sparsity: 0.0751, expected_sequence_sparsity: 0.8336, target_sparsity: 0.0834, step: 2600 lambda_1: -0.0131, lambda_2: 22.1076 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.43] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.4] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.4] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101110000000110100 loss: 0.055803, lagrangian_loss: 0.000015, attention_score_distillation_loss: 0.000035 loss: 0.021311, lagrangian_loss: 0.000040, attention_score_distillation_loss: 0.000036 ---------------------------------------------------------------------- time: 2023-07-19 14:46:16 Evaluating: matthews_correlation: 0.5992, eval_loss: 0.5877, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0757, expected_sparsity: 0.0751, expected_sequence_sparsity: 0.8336, target_sparsity: 0.085, step: 2650 lambda_1: 0.0446, lambda_2: 22.1086 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.43] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.4] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.4] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101110000000110100 loss: 0.098597, lagrangian_loss: -0.000021, attention_score_distillation_loss: 0.000035 loss: 0.007134, lagrangian_loss: 0.000040, attention_score_distillation_loss: 0.000035 ETA: 1:39:46 | Epoch 9 finished. Took 64.65 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:46:28 Evaluating: matthews_correlation: 0.5783, eval_loss: 0.6024, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0757, expected_sparsity: 0.0751, expected_sequence_sparsity: 0.8336, target_sparsity: 0.0866, step: 2700 lambda_1: -0.0606, lambda_2: 22.1116 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.99 1. 0.43] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.4] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.4] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01101110000000110100 loss: 0.022406, lagrangian_loss: 0.000240, attention_score_distillation_loss: 0.000036 loss: 0.023355, lagrangian_loss: 0.000445, attention_score_distillation_loss: 0.000036 ---------------------------------------------------------------------- time: 2023-07-19 14:46:41 Evaluating: matthews_correlation: 0.5833, eval_loss: 0.5985, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0883, expected_sparsity: 0.0814, expected_sequence_sparsity: 0.8347, target_sparsity: 0.0882, step: 2750 lambda_1: -0.2628, lambda_2: 22.1184 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.4 ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.35] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.35] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100110000000110100 loss: 0.028770, lagrangian_loss: 0.000085, attention_score_distillation_loss: 0.000035 loss: 0.978799, lagrangian_loss: -0.000313, attention_score_distillation_loss: 0.000034 ---------------------------------------------------------------------- time: 2023-07-19 14:46:53 Evaluating: matthews_correlation: 0.586, eval_loss: 0.593, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0883, expected_sparsity: 0.0876, expected_sequence_sparsity: 0.8359, target_sparsity: 0.0898, step: 2800 lambda_1: -0.1367, lambda_2: 22.1242 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.35] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.3] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.3] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100110000000100100 loss: 0.033585, lagrangian_loss: -0.000210, attention_score_distillation_loss: 0.000035 loss: 0.008678, lagrangian_loss: 0.000096, attention_score_distillation_loss: 0.000034 ---------------------------------------------------------------------- time: 2023-07-19 14:47:05 Evaluating: matthews_correlation: 0.5783, eval_loss: 0.5975, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0883, expected_sparsity: 0.0876, expected_sequence_sparsity: 0.8359, target_sparsity: 0.0914, step: 2850 lambda_1: 0.1540, lambda_2: 22.1381 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.37] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.3] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.3] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100110000000100100 loss: 0.027450, lagrangian_loss: -0.000082, attention_score_distillation_loss: 0.000035 loss: 0.058826, lagrangian_loss: -0.000158, attention_score_distillation_loss: 0.000036 ---------------------------------------------------------------------- time: 2023-07-19 14:47:18 Evaluating: matthews_correlation: 0.5761, eval_loss: 0.5901, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0883, expected_sparsity: 0.0814, expected_sequence_sparsity: 0.8347, target_sparsity: 0.093, step: 2900 lambda_1: -0.0371, lambda_2: 22.1479 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.99 1. 0.38] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.35] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.35] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100110000000110100 loss: 0.027593, lagrangian_loss: 0.000227, attention_score_distillation_loss: 0.000037 loss: 0.023575, lagrangian_loss: 0.000386, attention_score_distillation_loss: 0.000036 ETA: 1:38:22 | Epoch 10 finished. Took 64.32 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:47:30 Evaluating: matthews_correlation: 0.5911, eval_loss: 0.5896, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0883, expected_sparsity: 0.0876, expected_sequence_sparsity: 0.8359, target_sparsity: 0.0946, step: 2950 lambda_1: -0.3392, lambda_2: 22.1620 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.35] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.3] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.3] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100110000000100100 loss: 0.013229, lagrangian_loss: 0.000218, attention_score_distillation_loss: 0.000036 loss: 0.035288, lagrangian_loss: 0.000067, attention_score_distillation_loss: 0.000033 ---------------------------------------------------------------------- time: 2023-07-19 14:47:42 Evaluating: matthews_correlation: 0.5788, eval_loss: 0.5895, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.101, expected_sparsity: 0.0939, expected_sequence_sparsity: 0.837, target_sparsity: 0.0962, step: 3000 lambda_1: -0.3052, lambda_2: 22.1653 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.31] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.25] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.25] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100010000000100100 loss: 0.203443, lagrangian_loss: -0.000790, attention_score_distillation_loss: 0.000035 loss: 0.063466, lagrangian_loss: -0.000089, attention_score_distillation_loss: 0.000035 ---------------------------------------------------------------------- time: 2023-07-19 14:47:55 Evaluating: matthews_correlation: 0.5858, eval_loss: 0.5977, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.101, expected_sparsity: 0.0939, expected_sequence_sparsity: 0.837, target_sparsity: 0.0978, step: 3050 lambda_1: 0.1106, lambda_2: 22.1901 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.3 ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.25] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.25] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100010000000100100 loss: 0.183510, lagrangian_loss: 0.000345, attention_score_distillation_loss: 0.000035 loss: 0.003043, lagrangian_loss: -0.000178, attention_score_distillation_loss: 0.000035 ---------------------------------------------------------------------- time: 2023-07-19 14:48:07 Evaluating: matthews_correlation: 0.5855, eval_loss: 0.6007, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.101, expected_sparsity: 0.0939, expected_sequence_sparsity: 0.837, target_sparsity: 0.0994, step: 3100 lambda_1: 0.1311, lambda_2: 22.1967 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.32] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.25] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.25] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100010000000100100 loss: 0.017544, lagrangian_loss: -0.000055, attention_score_distillation_loss: 0.000037 loss: 0.037259, lagrangian_loss: 0.000469, attention_score_distillation_loss: 0.000034 ---------------------------------------------------------------------- time: 2023-07-19 14:48:20 Evaluating: matthews_correlation: 0.5806, eval_loss: 0.6054, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.101, expected_sparsity: 0.0939, expected_sequence_sparsity: 0.837, target_sparsity: 0.101, step: 3150 lambda_1: -0.2223, lambda_2: 22.2140 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.31] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.25] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.25] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100010000000100100 loss: 0.095270, lagrangian_loss: 0.001061, attention_score_distillation_loss: 0.000032 loss: 0.031699, lagrangian_loss: 0.000030, attention_score_distillation_loss: 0.000034 ---------------------------------------------------------------------- time: 2023-07-19 14:48:32 Evaluating: matthews_correlation: 0.5758, eval_loss: 0.593, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.101, expected_sparsity: 0.1001, expected_sequence_sparsity: 0.8382, target_sparsity: 0.1026, step: 3200 lambda_1: -0.3491, lambda_2: 22.2230 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 0.99 0.26] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100000000000100100 loss: 0.048212, lagrangian_loss: -0.000597, attention_score_distillation_loss: 0.000033 ETA: 1:37:40 | Epoch 11 finished. Took 69.68 seconds. loss: 0.023820, lagrangian_loss: -0.000387, attention_score_distillation_loss: 0.000034 ---------------------------------------------------------------------- time: 2023-07-19 14:48:44 Evaluating: matthews_correlation: 0.5781, eval_loss: 0.6075, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.101, expected_sparsity: 0.1001, expected_sequence_sparsity: 0.8382, target_sparsity: 0.1042, step: 3250 lambda_1: -0.0354, lambda_2: 22.2363 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.25] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100000000000100100 loss: 0.041774, lagrangian_loss: -0.000009, attention_score_distillation_loss: 0.000033 loss: 0.015969, lagrangian_loss: 0.000143, attention_score_distillation_loss: 0.000036 ---------------------------------------------------------------------- time: 2023-07-19 14:48:57 Evaluating: matthews_correlation: 0.5858, eval_loss: 0.6091, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.101, expected_sparsity: 0.1001, expected_sequence_sparsity: 0.8382, target_sparsity: 0.1058, step: 3300 lambda_1: 0.0847, lambda_2: 22.2426 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.26] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100000000000100100 loss: 0.009516, lagrangian_loss: -0.000068, attention_score_distillation_loss: 0.000034 loss: 0.012627, lagrangian_loss: 0.000020, attention_score_distillation_loss: 0.000035 ---------------------------------------------------------------------- time: 2023-07-19 14:49:09 Evaluating: matthews_correlation: 0.5753, eval_loss: 0.6047, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.101, expected_sparsity: 0.1001, expected_sequence_sparsity: 0.8382, target_sparsity: 0.1074, step: 3350 lambda_1: -0.1373, lambda_2: 22.2499 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.25] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100000000000100100 loss: 0.008312, lagrangian_loss: 0.000212, attention_score_distillation_loss: 0.000035 loss: 0.045754, lagrangian_loss: 0.000225, attention_score_distillation_loss: 0.000034 ---------------------------------------------------------------------- time: 2023-07-19 14:49:21 Evaluating: matthews_correlation: 0.5965, eval_loss: 0.5817, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.101, expected_sparsity: 0.1001, expected_sequence_sparsity: 0.8382, target_sparsity: 0.109, step: 3400 lambda_1: -0.3418, lambda_2: 22.2564 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.22] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100000000000100100 loss: 0.019644, lagrangian_loss: -0.000141, attention_score_distillation_loss: 0.000034 loss: 0.012050, lagrangian_loss: -0.000482, attention_score_distillation_loss: 0.000034 ---------------------------------------------------------------------- time: 2023-07-19 14:49:34 Evaluating: matthews_correlation: 0.5935, eval_loss: 0.5955, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.1136, expected_sparsity: 0.1064, expected_sequence_sparsity: 0.8393, target_sparsity: 0.1106, step: 3450 lambda_1: -0.1880, lambda_2: 22.2621 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 0.99 0.2 ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.15] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.15] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100000000000100000 loss: 1.036332, lagrangian_loss: -0.000291, attention_score_distillation_loss: 0.000033 loss: 0.005280, lagrangian_loss: 0.000030, attention_score_distillation_loss: 0.000034 ETA: 1:36:18 | Epoch 12 finished. Took 64.32 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:49:46 Evaluating: matthews_correlation: 0.594, eval_loss: 0.5788, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.1136, expected_sparsity: 0.1064, expected_sequence_sparsity: 0.8393, target_sparsity: 0.1122, step: 3500 lambda_1: 0.0587, lambda_2: 22.2715 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 0.99 0.2 ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.15] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.15] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100000000000100000 loss: 0.016748, lagrangian_loss: 0.000091, attention_score_distillation_loss: 0.000034 loss: 0.016503, lagrangian_loss: -0.000026, attention_score_distillation_loss: 0.000033 ---------------------------------------------------------------------- time: 2023-07-19 14:49:59 Evaluating: matthews_correlation: 0.5936, eval_loss: 0.5855, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.1136, expected_sparsity: 0.1064, expected_sequence_sparsity: 0.8393, target_sparsity: 0.1139, step: 3550 lambda_1: -0.0812, lambda_2: 22.2778 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.98 1. 0.21] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.15] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.15] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100000000000100000 loss: 0.078844, lagrangian_loss: 0.000422, attention_score_distillation_loss: 0.000032 loss: 0.030926, lagrangian_loss: 0.000447, attention_score_distillation_loss: 0.000034 ---------------------------------------------------------------------- time: 2023-07-19 14:50:11 Evaluating: matthews_correlation: 0.5911, eval_loss: 0.5846, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.1136, expected_sparsity: 0.1064, expected_sequence_sparsity: 0.8393, target_sparsity: 0.1155, step: 3600 lambda_1: -0.4326, lambda_2: 22.2918 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.97 0.99 0.19] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.15] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.15] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 01100000000000100000 loss: 0.075252, lagrangian_loss: 0.000284, attention_score_distillation_loss: 0.000033 loss: 0.024099, lagrangian_loss: -0.000676, attention_score_distillation_loss: 0.000032 ---------------------------------------------------------------------- time: 2023-07-19 14:50:24 Evaluating: matthews_correlation: 0.5911, eval_loss: 0.5809, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1287, expected_sparsity: 0.1149, expected_sequence_sparsity: 0.8409, target_sparsity: 0.1171, step: 3650 lambda_1: -0.3939, lambda_2: 22.2953 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.97 0.99 0.17] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.15] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.14] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 01100000000000100000 loss: 0.027547, lagrangian_loss: -0.000537, attention_score_distillation_loss: 0.000033 loss: 0.021576, lagrangian_loss: -0.000448, attention_score_distillation_loss: 0.000034 ---------------------------------------------------------------------- time: 2023-07-19 14:50:36 Evaluating: matthews_correlation: 0.6043, eval_loss: 0.5773, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1287, expected_sparsity: 0.1208, expected_sequence_sparsity: 0.8419, target_sparsity: 0.1187, step: 3700 lambda_1: -0.0768, lambda_2: 22.3066 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.97 0.99 0.15] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.1] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.1] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 00100000000000100000 loss: 0.025838, lagrangian_loss: -0.000051, attention_score_distillation_loss: 0.000032 loss: 0.012543, lagrangian_loss: 0.000088, attention_score_distillation_loss: 0.000034 ---------------------------------------------------------------------- time: 2023-07-19 14:50:48 Evaluating: matthews_correlation: 0.5885, eval_loss: 0.5821, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1287, expected_sparsity: 0.1208, expected_sequence_sparsity: 0.8419, target_sparsity: 0.1203, step: 3750 lambda_1: 0.0835, lambda_2: 22.3115 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.97 0.99 0.15] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.1] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.1] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 00100000000000100000 loss: 0.009182, lagrangian_loss: -0.000018, attention_score_distillation_loss: 0.000033 ETA: 1:35:36 | Epoch 13 finished. Took 70.27 seconds. loss: 0.010844, lagrangian_loss: 0.000003, attention_score_distillation_loss: 0.000033 ---------------------------------------------------------------------- time: 2023-07-19 14:51:01 Evaluating: matthews_correlation: 0.591, eval_loss: 0.5859, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1287, expected_sparsity: 0.1208, expected_sequence_sparsity: 0.8419, target_sparsity: 0.1219, step: 3800 lambda_1: -0.1335, lambda_2: 22.3200 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.97 0.99 0.16] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.1] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.1] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 00100000000000100000 loss: 0.412542, lagrangian_loss: 0.000378, attention_score_distillation_loss: 0.000032 loss: 0.018157, lagrangian_loss: 0.001308, attention_score_distillation_loss: 0.000032 ---------------------------------------------------------------------- time: 2023-07-19 14:51:13 Evaluating: matthews_correlation: 0.5831, eval_loss: 0.5894, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1287, expected_sparsity: 0.1208, expected_sequence_sparsity: 0.8419, target_sparsity: 0.1235, step: 3850 lambda_1: -0.5598, lambda_2: 22.3384 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.97 0.99 0.14] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.1] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.1] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 00100000000000100000 loss: 0.124033, lagrangian_loss: 0.000632, attention_score_distillation_loss: 0.000032 loss: 0.020864, lagrangian_loss: -0.000210, attention_score_distillation_loss: 0.000032 ---------------------------------------------------------------------- time: 2023-07-19 14:51:26 Evaluating: matthews_correlation: 0.5858, eval_loss: 0.5849, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1287, expected_sparsity: 0.1208, expected_sequence_sparsity: 0.8419, target_sparsity: 0.1251, step: 3900 lambda_1: -0.6381, lambda_2: 22.3424 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.96 0.99 0.12] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.1] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.1] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 00100000000000100000 loss: 0.005461, lagrangian_loss: 0.000558, attention_score_distillation_loss: 0.000030 loss: 0.012050, lagrangian_loss: -0.001116, attention_score_distillation_loss: 0.000032 ---------------------------------------------------------------------- time: 2023-07-19 14:51:38 Evaluating: matthews_correlation: 0.5967, eval_loss: 0.5731, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1287, expected_sparsity: 0.1208, expected_sequence_sparsity: 0.8419, target_sparsity: 0.1267, step: 3950 lambda_1: -0.3776, lambda_2: 22.3517 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.96 0.99 0.11] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.1] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.1] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 00100000000000100000 loss: 0.008747, lagrangian_loss: -0.000572, attention_score_distillation_loss: 0.000031 loss: 0.040760, lagrangian_loss: -0.000146, attention_score_distillation_loss: 0.000031 ---------------------------------------------------------------------- time: 2023-07-19 14:51:51 Evaluating: matthews_correlation: 0.5889, eval_loss: 0.5881, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1287, expected_sparsity: 0.1208, expected_sequence_sparsity: 0.8419, target_sparsity: 0.1283, step: 4000 lambda_1: -0.0766, lambda_2: 22.3623 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.96 0.99 0.1 ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.1] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.1] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 00100000000000100000 loss: 0.004737, lagrangian_loss: -0.000064, attention_score_distillation_loss: 0.000032 ETA: 1:34:18 | Epoch 14 finished. Took 64.72 seconds. loss: 0.014773, lagrangian_loss: -0.000012, attention_score_distillation_loss: 0.000032 ---------------------------------------------------------------------- time: 2023-07-19 14:52:03 Evaluating: matthews_correlation: 0.5758, eval_loss: 0.5977, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1287, expected_sparsity: 0.1208, expected_sequence_sparsity: 0.8419, target_sparsity: 0.1299, step: 4050 lambda_1: -0.1027, lambda_2: 22.3660 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.96 0.99 0.1 ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.1] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.1] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 00100000000000100000 loss: 0.015971, lagrangian_loss: 0.000233, attention_score_distillation_loss: 0.000030 loss: 0.014688, lagrangian_loss: 0.000446, attention_score_distillation_loss: 0.000030 ---------------------------------------------------------------------- time: 2023-07-19 14:52:15 Evaluating: matthews_correlation: 0.573, eval_loss: 0.5924, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1287, expected_sparsity: 0.1208, expected_sequence_sparsity: 0.8419, target_sparsity: 0.1315, step: 4100 lambda_1: -0.3576, lambda_2: 22.3739 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.96 0.99 0.1 ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.1] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.1] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 00100000000000100000 loss: 0.010542, lagrangian_loss: 0.000652, attention_score_distillation_loss: 0.000031 loss: 0.011848, lagrangian_loss: -0.000071, attention_score_distillation_loss: 0.000032 ---------------------------------------------------------------------- time: 2023-07-19 14:52:28 Evaluating: matthews_correlation: 0.5626, eval_loss: 0.5947, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1413, expected_sparsity: 0.1267, expected_sequence_sparsity: 0.843, target_sparsity: 0.1331, step: 4150 lambda_1: -0.5693, lambda_2: 22.3797 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.95 0.99 0.09] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.05] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 00000000000000100000 loss: 0.048168, lagrangian_loss: 0.000115, attention_score_distillation_loss: 0.000032 loss: 0.570838, lagrangian_loss: -0.000162, attention_score_distillation_loss: 0.000031 ---------------------------------------------------------------------- time: 2023-07-19 14:52:40 Evaluating: matthews_correlation: 0.5882, eval_loss: 0.5913, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1413, expected_sparsity: 0.1267, expected_sequence_sparsity: 0.843, target_sparsity: 0.1347, step: 4200 lambda_1: -0.6470, lambda_2: 22.3830 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.95 0.98 0.07] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.05] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 00000000000000100000 loss: 0.010227, lagrangian_loss: -0.000559, attention_score_distillation_loss: 0.000031 loss: 0.016885, lagrangian_loss: -0.000750, attention_score_distillation_loss: 0.000031 ---------------------------------------------------------------------- time: 2023-07-19 14:52:53 Evaluating: matthews_correlation: 0.5908, eval_loss: 0.5886, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1413, expected_sparsity: 0.1267, expected_sequence_sparsity: 0.843, target_sparsity: 0.1363, step: 4250 lambda_1: -0.5089, lambda_2: 22.3872 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.94 0.98 0.06] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.05] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 00000000000000100000 loss: 0.117484, lagrangian_loss: -0.000578, attention_score_distillation_loss: 0.000030 loss: 0.014682, lagrangian_loss: -0.000055, attention_score_distillation_loss: 0.000030 ETA: 1:33:01 | Epoch 15 finished. Took 64.67 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:53:05 Evaluating: matthews_correlation: 0.5858, eval_loss: 0.5894, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1413, expected_sparsity: 0.1267, expected_sequence_sparsity: 0.843, target_sparsity: 0.1379, step: 4300 lambda_1: -0.2740, lambda_2: 22.3941 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.94 0.97 0.06] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.05] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 00000000000000100000 loss: 0.026455, lagrangian_loss: -0.000448, attention_score_distillation_loss: 0.000031 loss: 0.013203, lagrangian_loss: -0.000039, attention_score_distillation_loss: 0.000030 ---------------------------------------------------------------------- time: 2023-07-19 14:53:18 Evaluating: matthews_correlation: 0.5914, eval_loss: 0.5695, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1413, expected_sparsity: 0.1267, expected_sequence_sparsity: 0.843, target_sparsity: 0.1395, step: 4350 lambda_1: -0.0852, lambda_2: 22.4013 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.94 0.96 0.06] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.05] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 00000000000000100000 loss: 0.014955, lagrangian_loss: -0.000078, attention_score_distillation_loss: 0.000031 loss: 0.007643, lagrangian_loss: 0.000165, attention_score_distillation_loss: 0.000030 ---------------------------------------------------------------------- time: 2023-07-19 14:53:30 Evaluating: matthews_correlation: 0.602, eval_loss: 0.5733, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1413, expected_sparsity: 0.1346, expected_sequence_sparsity: 0.8445, target_sparsity: 0.1411, step: 4400 lambda_1: -0.2008, lambda_2: 22.4080 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.93 0.97 0.06] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 1.0, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.05] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111111 00000000000000100000 loss: 0.056004, lagrangian_loss: 0.000214, attention_score_distillation_loss: 0.000030 loss: 0.119878, lagrangian_loss: 0.000055, attention_score_distillation_loss: 0.000030 ---------------------------------------------------------------------- time: 2023-07-19 14:53:42 Evaluating: matthews_correlation: 0.5911, eval_loss: 0.589, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1413, expected_sparsity: 0.1346, expected_sequence_sparsity: 0.8445, target_sparsity: 0.1427, step: 4450 lambda_1: -0.4311, lambda_2: 22.4146 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.92 0.97 0.06] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 1.0, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.05] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111111 00000000000000100000 loss: 0.020398, lagrangian_loss: 0.000288, attention_score_distillation_loss: 0.000030 loss: 0.012248, lagrangian_loss: 0.000095, attention_score_distillation_loss: 0.000029 ---------------------------------------------------------------------- time: 2023-07-19 14:53:55 Evaluating: matthews_correlation: 0.5833, eval_loss: 0.5985, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1413, expected_sparsity: 0.1346, expected_sequence_sparsity: 0.8445, target_sparsity: 0.1443, step: 4500 lambda_1: -0.5397, lambda_2: 22.4193 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.91 0.95 0.06] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 1.0, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.05] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111111 00000000000000100000 loss: 0.099217, lagrangian_loss: -0.000194, attention_score_distillation_loss: 0.000030 loss: 0.043075, lagrangian_loss: 0.000008, attention_score_distillation_loss: 0.000029 ---------------------------------------------------------------------- time: 2023-07-19 14:54:07 Evaluating: matthews_correlation: 0.5916, eval_loss: 0.5938, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1413, expected_sparsity: 0.1346, expected_sequence_sparsity: 0.8445, target_sparsity: 0.1459, step: 4550 lambda_1: -0.3276, lambda_2: 22.4253 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.91 0.94 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 1.0, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.05] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111111 00000000000000100000 loss: 0.024051, lagrangian_loss: -0.000425, attention_score_distillation_loss: 0.000030 ETA: 1:32:12 | Epoch 16 finished. Took 70.05 seconds. loss: 0.019365, lagrangian_loss: -0.000034, attention_score_distillation_loss: 0.000029 ---------------------------------------------------------------------- time: 2023-07-19 14:54:20 Evaluating: matthews_correlation: 0.5911, eval_loss: 0.6039, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1565, expected_sparsity: 0.1456, expected_sequence_sparsity: 0.8465, target_sparsity: 0.1475, step: 4600 lambda_1: -0.1316, lambda_2: 22.4317 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.91 0.93 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.85, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.76, 0.04] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111011100 00000000000000100000 loss: 0.015307, lagrangian_loss: -0.000100, attention_score_distillation_loss: 0.000030 loss: 0.008962, lagrangian_loss: 0.000146, attention_score_distillation_loss: 0.000030 ---------------------------------------------------------------------- time: 2023-07-19 14:54:32 Evaluating: matthews_correlation: 0.5834, eval_loss: 0.5972, token_prune_loc: [False, False, False, False, False, False, False, True, False, True], macs_sparsity: 0.1413, expected_sparsity: 0.1346, expected_sequence_sparsity: 0.8445, target_sparsity: 0.1492, step: 4650 lambda_1: -0.3024, lambda_2: 22.4404 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.91 0.93 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 1.0, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.05] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111111 00000000000000100000 loss: 0.036502, lagrangian_loss: 0.001332, attention_score_distillation_loss: 0.000029 loss: 0.481697, lagrangian_loss: 0.000634, attention_score_distillation_loss: 0.000029 ---------------------------------------------------------------------- time: 2023-07-19 14:54:44 Evaluating: matthews_correlation: 0.5863, eval_loss: 0.592, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1565, expected_sparsity: 0.1456, expected_sequence_sparsity: 0.8465, target_sparsity: 0.1508, step: 4700 lambda_1: -0.7250, lambda_2: 22.4578 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.9 0.9 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.85, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.76, 0.04] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111011100 00000000000000000001 loss: 0.093523, lagrangian_loss: 0.000921, attention_score_distillation_loss: 0.000029 loss: 0.005480, lagrangian_loss: -0.000381, attention_score_distillation_loss: 0.000029 ---------------------------------------------------------------------- time: 2023-07-19 14:54:57 Evaluating: matthews_correlation: 0.5865, eval_loss: 0.6104, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1565, expected_sparsity: 0.1456, expected_sequence_sparsity: 0.8465, target_sparsity: 0.1524, step: 4750 lambda_1: -0.8571, lambda_2: 22.4639 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.9 0.87 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.85, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.76, 0.04] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111011100 00000000000000000001 loss: 0.034784, lagrangian_loss: 0.000083, attention_score_distillation_loss: 0.000029 loss: 0.005347, lagrangian_loss: 0.001107, attention_score_distillation_loss: 0.000029 ---------------------------------------------------------------------- time: 2023-07-19 14:55:09 Evaluating: matthews_correlation: 0.5865, eval_loss: 0.593, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1565, expected_sparsity: 0.1456, expected_sequence_sparsity: 0.8465, target_sparsity: 0.154, step: 4800 lambda_1: -0.8326, lambda_2: 22.4694 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.9 0.86 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.85, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.76, 0.04] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111011100 00100000000000000000 loss: 0.016546, lagrangian_loss: 0.001832, attention_score_distillation_loss: 0.000027 ETA: 1:30:56 | Epoch 17 finished. Took 64.66 seconds. loss: 0.021696, lagrangian_loss: 0.001651, attention_score_distillation_loss: 0.000028 ---------------------------------------------------------------------- time: 2023-07-19 14:55:22 Evaluating: matthews_correlation: 0.5962, eval_loss: 0.5829, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1565, expected_sparsity: 0.1456, expected_sequence_sparsity: 0.8465, target_sparsity: 0.1556, step: 4850 lambda_1: -1.0078, lambda_2: 22.4757 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.9 0.85 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.85, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.76, 0.04] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111011100 00000000000000000001 loss: 0.718562, lagrangian_loss: 0.000608, attention_score_distillation_loss: 0.000028 loss: 0.045760, lagrangian_loss: -0.000431, attention_score_distillation_loss: 0.000027 ---------------------------------------------------------------------- time: 2023-07-19 14:55:34 Evaluating: matthews_correlation: 0.5841, eval_loss: 0.6002, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1565, expected_sparsity: 0.1493, expected_sequence_sparsity: 0.8471, target_sparsity: 0.1572, step: 4900 lambda_1: -0.7682, lambda_2: 22.4900 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 1. 1. 0.9 0.81 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.8, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.72, 0.04] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111110111011100 00000000000001000000 loss: 0.021543, lagrangian_loss: -0.001404, attention_score_distillation_loss: 0.000028 loss: 0.004239, lagrangian_loss: -0.000569, attention_score_distillation_loss: 0.000028 ---------------------------------------------------------------------- time: 2023-07-19 14:55:47 Evaluating: matthews_correlation: 0.592, eval_loss: 0.5934, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1565, expected_sparsity: 0.1493, expected_sequence_sparsity: 0.8471, target_sparsity: 0.1588, step: 4950 lambda_1: -0.3426, lambda_2: 22.5045 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.99 0.99 0.9 0.81 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.8, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.72, 0.04] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111110111011100 00000000000000000001 loss: 0.585421, lagrangian_loss: -0.000359, attention_score_distillation_loss: 0.000028 loss: 0.031455, lagrangian_loss: -0.000256, attention_score_distillation_loss: 0.000029 ---------------------------------------------------------------------- time: 2023-07-19 14:55:59 Evaluating: matthews_correlation: 0.5936, eval_loss: 0.5982, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1565, expected_sparsity: 0.1493, expected_sequence_sparsity: 0.8471, target_sparsity: 0.1604, step: 5000 lambda_1: -0.2333, lambda_2: 22.5118 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.99 0.99 0.9 0.8 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.8, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.72, 0.04] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111110111011100 00000100000000000000 loss: 0.008938, lagrangian_loss: 0.000632, attention_score_distillation_loss: 0.000027 loss: 0.021823, lagrangian_loss: 0.000997, attention_score_distillation_loss: 0.000027 ---------------------------------------------------------------------- time: 2023-07-19 14:56:12 Evaluating: matthews_correlation: 0.5812, eval_loss: 0.6086, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1565, expected_sparsity: 0.1493, expected_sequence_sparsity: 0.8471, target_sparsity: 0.162, step: 5050 lambda_1: -0.6176, lambda_2: 22.5244 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.99 0.99 0.9 0.8 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.8, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.72, 0.04] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111110111011100 00000000000000000001 loss: 0.029746, lagrangian_loss: 0.001429, attention_score_distillation_loss: 0.000028 loss: 0.025346, lagrangian_loss: 0.000932, attention_score_distillation_loss: 0.000028 ETA: 1:29:41 | Epoch 18 finished. Took 64.57 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:56:24 Evaluating: matthews_correlation: 0.5836, eval_loss: 0.5949, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1641, expected_sparsity: 0.1529, expected_sequence_sparsity: 0.8478, target_sparsity: 0.1636, step: 5100 lambda_1: -1.0965, lambda_2: 22.5411 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.99 0.99 0.9 0.77 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.75, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.68, 0.03] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111110111001100 00000000000000000001 loss: 0.046629, lagrangian_loss: -0.000615, attention_score_distillation_loss: 0.000028 loss: 0.014266, lagrangian_loss: 0.000631, attention_score_distillation_loss: 0.000026 ---------------------------------------------------------------------- time: 2023-07-19 14:56:36 Evaluating: matthews_correlation: 0.5839, eval_loss: 0.6034, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1641, expected_sparsity: 0.1529, expected_sequence_sparsity: 0.8478, target_sparsity: 0.1652, step: 5150 lambda_1: -1.0905, lambda_2: 22.5491 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.99 0.99 0.9 0.73 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.75, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.68, 0.03] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111110111001100 10000000000000000000 loss: 0.010103, lagrangian_loss: -0.001868, attention_score_distillation_loss: 0.000028 loss: 0.037788, lagrangian_loss: -0.001724, attention_score_distillation_loss: 0.000027 ---------------------------------------------------------------------- time: 2023-07-19 14:56:49 Evaluating: matthews_correlation: 0.5774, eval_loss: 0.5942, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1641, expected_sparsity: 0.1566, expected_sequence_sparsity: 0.8485, target_sparsity: 0.1668, step: 5200 lambda_1: -0.2889, lambda_2: 22.5960 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.99 0.99 0.9 0.71 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.7, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.63, 0.03] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11110111110111001100 00000000100000000000 loss: 0.028017, lagrangian_loss: -0.000814, attention_score_distillation_loss: 0.000028 loss: 0.015965, lagrangian_loss: 0.000656, attention_score_distillation_loss: 0.000027 ---------------------------------------------------------------------- time: 2023-07-19 14:57:01 Evaluating: matthews_correlation: 0.5874, eval_loss: 0.5963, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1641, expected_sparsity: 0.1566, expected_sequence_sparsity: 0.8485, target_sparsity: 0.1684, step: 5250 lambda_1: 0.5723, lambda_2: 22.6506 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.99 0.99 0.9 0.71 0.06] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.7, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.63, 0.03] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11110111110111001100 00000000000010000000 loss: 0.018675, lagrangian_loss: -0.000544, attention_score_distillation_loss: 0.000026 loss: 0.018954, lagrangian_loss: -0.001111, attention_score_distillation_loss: 0.000027 ---------------------------------------------------------------------- time: 2023-07-19 14:57:14 Evaluating: matthews_correlation: 0.5843, eval_loss: 0.6071, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1641, expected_sparsity: 0.1566, expected_sequence_sparsity: 0.8485, target_sparsity: 0.17, step: 5300 lambda_1: -0.1986, lambda_2: 22.7167 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.99 0.99 0.9 0.73 0.07] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.7, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.63, 0.03] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11110111110111001100 00000000000000000001 loss: 0.004711, lagrangian_loss: 0.002003, attention_score_distillation_loss: 0.000027 loss: 0.017301, lagrangian_loss: 0.004063, attention_score_distillation_loss: 0.000027 ---------------------------------------------------------------------- time: 2023-07-19 14:57:26 Evaluating: matthews_correlation: 0.5946, eval_loss: 0.5992, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1641, expected_sparsity: 0.1566, expected_sequence_sparsity: 0.8485, target_sparsity: 0.1716, step: 5350 lambda_1: -1.4901, lambda_2: 22.8375 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.99 0.99 0.9 0.69 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.7, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.63, 0.03] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11110111110111001100 00000000010000000000 loss: 0.021910, lagrangian_loss: 0.003021, attention_score_distillation_loss: 0.000027 ETA: 1:28:50 | Epoch 19 finished. Took 70.29 seconds. loss: 0.025565, lagrangian_loss: -0.003471, attention_score_distillation_loss: 0.000027 ---------------------------------------------------------------------- time: 2023-07-19 14:57:39 Evaluating: matthews_correlation: 0.5895, eval_loss: 0.5935, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1717, expected_sparsity: 0.1603, expected_sequence_sparsity: 0.8491, target_sparsity: 0.1732, step: 5400 lambda_1: -1.5134, lambda_2: 22.8594 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.99 0.98 0.89 0.64 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.65, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.59, 0.03] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11110111010111001100 00000000000000000001 loss: 0.272059, lagrangian_loss: -0.006515, attention_score_distillation_loss: 0.000028 loss: 0.075461, lagrangian_loss: -0.004524, attention_score_distillation_loss: 0.000027 ---------------------------------------------------------------------- time: 2023-07-19 14:57:51 Evaluating: matthews_correlation: 0.5892, eval_loss: 0.6012, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1717, expected_sparsity: 0.1639, expected_sequence_sparsity: 0.8498, target_sparsity: 0.1748, step: 5450 lambda_1: -0.1104, lambda_2: 23.0135 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.98 0.98 0.89 0.61 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.6, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.54, 0.03] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11110110010111001100 00000000000000100000 loss: 0.014819, lagrangian_loss: 0.000105, attention_score_distillation_loss: 0.000026 loss: 0.020864, lagrangian_loss: 0.003519, attention_score_distillation_loss: 0.000027 ---------------------------------------------------------------------- time: 2023-07-19 14:58:04 Evaluating: matthews_correlation: 0.582, eval_loss: 0.6131, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1717, expected_sparsity: 0.1639, expected_sequence_sparsity: 0.8498, target_sparsity: 0.1764, step: 5500 lambda_1: 1.1297, lambda_2: 23.1567 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.99 0.99 0.9 0.63 0.06] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.6, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.54, 0.03] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11110110010111001100 00000000000000000001 loss: 0.037544, lagrangian_loss: -0.002247, attention_score_distillation_loss: 0.000026 loss: 0.431902, lagrangian_loss: -0.004408, attention_score_distillation_loss: 0.000026 ---------------------------------------------------------------------- time: 2023-07-19 14:58:16 Evaluating: matthews_correlation: 0.5879, eval_loss: 0.5996, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1717, expected_sparsity: 0.1566, expected_sequence_sparsity: 0.8485, target_sparsity: 0.178, step: 5550 lambda_1: -0.7108, lambda_2: 23.6092 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 0.99 0.9 0.65 0.08] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.65, 0.1] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.59, 0.06] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11110111010111001100 00000000000000000011 loss: 0.013381, lagrangian_loss: 0.009721, attention_score_distillation_loss: 0.000026 loss: 0.030895, lagrangian_loss: 0.003778, attention_score_distillation_loss: 0.000027 ---------------------------------------------------------------------- time: 2023-07-19 14:58:29 Evaluating: matthews_correlation: 0.5786, eval_loss: 0.6237, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1717, expected_sparsity: 0.1639, expected_sequence_sparsity: 0.8498, target_sparsity: 0.1796, step: 5600 lambda_1: -2.0978, lambda_2: 23.9368 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.99 0.98 0.89 0.59 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.6, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.54, 0.03] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11110110010111001100 10000000000000000000 loss: 0.005120, lagrangian_loss: 0.000907, attention_score_distillation_loss: 0.000025 loss: 0.015775, lagrangian_loss: -0.003264, attention_score_distillation_loss: 0.000026 ETA: 1:27:38 | Epoch 20 finished. Took 64.99 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:58:41 Evaluating: matthews_correlation: 0.5717, eval_loss: 0.6302, token_prune_loc: [False, False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1944, expected_sparsity: 0.1859, expected_sequence_sparsity: 0.8538, target_sparsity: 0.1812, step: 5650 lambda_1: -1.4996, lambda_2: 24.0207 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.98 0.96 0.89 0.57 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.55, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.45, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111100 11110110010110001100 00000000000000000001 loss: 0.014364, lagrangian_loss: -0.007478, attention_score_distillation_loss: 0.000025 loss: 0.013550, lagrangian_loss: -0.003740, attention_score_distillation_loss: 0.000026 ---------------------------------------------------------------------- time: 2023-07-19 14:58:54 Evaluating: matthews_correlation: 0.5715, eval_loss: 0.6287, token_prune_loc: [False, False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1944, expected_sparsity: 0.1859, expected_sequence_sparsity: 0.8538, target_sparsity: 0.1828, step: 5700 lambda_1: 0.5183, lambda_2: 24.5603 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.98 0.95 0.88 0.57 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.55, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.45, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111100 11110110010110001100 00000000000000010000 loss: 0.022355, lagrangian_loss: 0.008738, attention_score_distillation_loss: 0.000027 loss: 0.034574, lagrangian_loss: 0.001278, attention_score_distillation_loss: 0.000026 ---------------------------------------------------------------------- time: 2023-07-19 14:59:06 Evaluating: matthews_correlation: 0.5797, eval_loss: 0.6219, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1717, expected_sparsity: 0.1605, expected_sequence_sparsity: 0.8492, target_sparsity: 0.1845, step: 5750 lambda_1: 1.2857, lambda_2: 24.8398 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.99 0.98 0.9 0.6 0.09] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.6, 0.1] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.54, 0.05] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11110110010111001100 00000000001000000001 loss: 0.003873, lagrangian_loss: -0.011571, attention_score_distillation_loss: 0.000024 loss: 0.007203, lagrangian_loss: 0.003571, attention_score_distillation_loss: 0.000026 ---------------------------------------------------------------------- time: 2023-07-19 14:59:18 Evaluating: matthews_correlation: 0.5617, eval_loss: 0.6339, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1717, expected_sparsity: 0.1639, expected_sequence_sparsity: 0.8498, target_sparsity: 0.1861, step: 5800 lambda_1: -1.2707, lambda_2: 26.1359 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.99 0.99 0.9 0.59 0.07] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.6, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.54, 0.03] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11110110010111001100 00000000000000000001 loss: 0.011294, lagrangian_loss: 0.015263, attention_score_distillation_loss: 0.000026 loss: 0.052171, lagrangian_loss: 0.010876, attention_score_distillation_loss: 0.000025 ---------------------------------------------------------------------- time: 2023-07-19 14:59:31 Evaluating: matthews_correlation: 0.5771, eval_loss: 0.6168, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1717, expected_sparsity: 0.1676, expected_sequence_sparsity: 0.8505, target_sparsity: 0.1877, step: 5850 lambda_1: -2.4409, lambda_2: 26.5250 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.98 0.98 0.88 0.56 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.55, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.5, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11110110010110001100 00000100000000000000 loss: 0.017062, lagrangian_loss: -0.000978, attention_score_distillation_loss: 0.000025 loss: 0.010313, lagrangian_loss: -0.012541, attention_score_distillation_loss: 0.000025 ETA: 1:26:24 | Epoch 21 finished. Took 64.64 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:59:43 Evaluating: matthews_correlation: 0.572, eval_loss: 0.6249, token_prune_loc: [False, False, False, False, False, True, False, True, True, True], macs_sparsity: 0.202, expected_sparsity: 0.1862, expected_sequence_sparsity: 0.8539, target_sparsity: 0.1893, step: 5900 lambda_1: -1.7888, lambda_2: 26.7386 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.97 0.95 0.85 0.53 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 1.0, 0.85, 0.55, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.81, 0.44, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111111 11111111111111011100 11110110010110001100 00000000000000000001 loss: 0.203591, lagrangian_loss: -0.013813, attention_score_distillation_loss: 0.000024 loss: 0.010098, lagrangian_loss: -0.004911, attention_score_distillation_loss: 0.000025 ---------------------------------------------------------------------- time: 2023-07-19 14:59:56 Evaluating: matthews_correlation: 0.5851, eval_loss: 0.6191, token_prune_loc: [False, False, False, False, False, False, True, True, True, True], macs_sparsity: 0.202, expected_sparsity: 0.1913, expected_sequence_sparsity: 0.8548, target_sparsity: 0.1909, step: 5950 lambda_1: 0.3627, lambda_2: 27.8935 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.97 0.95 0.84 0.53 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.85, 0.55, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.76, 0.42, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111011100 11110110010110001100 00000000000000010000 loss: 0.125223, lagrangian_loss: 0.006789, attention_score_distillation_loss: 0.000025 loss: 0.037516, lagrangian_loss: 0.002859, attention_score_distillation_loss: 0.000023 ---------------------------------------------------------------------- time: 2023-07-19 15:00:08 Evaluating: matthews_correlation: 0.5927, eval_loss: 0.6107, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1868, expected_sparsity: 0.1707, expected_sequence_sparsity: 0.851, target_sparsity: 0.1925, step: 6000 lambda_1: 1.3603, lambda_2: 28.3284 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.98 0.97 0.87 0.56 0.08] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85, 0.55, 0.1] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85, 0.47, 0.05] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111011100 11110110010110001100 00000000000000001001 loss: 0.011271, lagrangian_loss: -0.008582, attention_score_distillation_loss: 0.000024 loss: 0.017158, lagrangian_loss: -0.004805, attention_score_distillation_loss: 0.000024 ---------------------------------------------------------------------- time: 2023-07-19 15:00:21 Evaluating: matthews_correlation: 0.5825, eval_loss: 0.6188, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1717, expected_sparsity: 0.1605, expected_sequence_sparsity: 0.8492, target_sparsity: 0.1941, step: 6050 lambda_1: -0.3824, lambda_2: 29.4481 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.99 0.97 0.88 0.58 0.09] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.6, 0.1] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.54, 0.05] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11110110010110001101 00000000000000001001 loss: 0.027891, lagrangian_loss: 0.009296, attention_score_distillation_loss: 0.000025 loss: 0.438040, lagrangian_loss: 0.009376, attention_score_distillation_loss: 0.000023 ---------------------------------------------------------------------- time: 2023-07-19 15:00:33 Evaluating: matthews_correlation: 0.5879, eval_loss: 0.602, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1868, expected_sparsity: 0.1736, expected_sequence_sparsity: 0.8516, target_sparsity: 0.1957, step: 6100 lambda_1: -1.6030, lambda_2: 30.1202 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.98 0.96 0.85 0.53 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85, 0.55, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85, 0.47, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111011100 11110110010110001100 00000000000000001000 loss: 0.026147, lagrangian_loss: 0.001642, attention_score_distillation_loss: 0.000024 loss: 0.515038, lagrangian_loss: -0.007879, attention_score_distillation_loss: 0.000025 ---------------------------------------------------------------------- time: 2023-07-19 15:00:45 Evaluating: matthews_correlation: 0.5825, eval_loss: 0.617, token_prune_loc: [False, False, False, False, False, False, True, True, True, True], macs_sparsity: 0.2096, expected_sparsity: 0.1997, expected_sequence_sparsity: 0.8563, target_sparsity: 0.1973, step: 6150 lambda_1: -1.1508, lambda_2: 30.2563 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.97 0.94 0.83 0.52 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.8, 0.5, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.72, 0.36, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111011111011100 11110110010110001000 00000000000000000001 loss: 0.046056, lagrangian_loss: -0.005879, attention_score_distillation_loss: 0.000024 ETA: 1:25:30 | Epoch 22 finished. Took 70.22 seconds. loss: 0.211151, lagrangian_loss: -0.003105, attention_score_distillation_loss: 0.000025 ---------------------------------------------------------------------- time: 2023-07-19 15:00:58 Evaluating: matthews_correlation: 0.5828, eval_loss: 0.6092, token_prune_loc: [False, False, False, False, False, False, True, True, True, True], macs_sparsity: 0.2096, expected_sparsity: 0.1997, expected_sequence_sparsity: 0.8563, target_sparsity: 0.1989, step: 6200 lambda_1: -0.0944, lambda_2: 30.6436 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.97 0.94 0.82 0.52 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.8, 0.5, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.72, 0.36, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111011111011100 11110110010110001000 00000000000000100000 loss: 0.319992, lagrangian_loss: 0.000234, attention_score_distillation_loss: 0.000024 loss: 0.025576, lagrangian_loss: 0.001695, attention_score_distillation_loss: 0.000024 ---------------------------------------------------------------------- time: 2023-07-19 15:01:10 Evaluating: matthews_correlation: 0.5933, eval_loss: 0.5987, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1868, expected_sparsity: 0.1771, expected_sequence_sparsity: 0.8522, target_sparsity: 0.2005, step: 6250 lambda_1: 0.5509, lambda_2: 30.8208 lambda_3: 0.0000 train remain: [0.99 1. 1. 1. 1. 0.98 0.95 0.83 0.52 0.06] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85, 0.5, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85, 0.42, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111011100 11110110010110001000 00000000000001000000 loss: 0.016043, lagrangian_loss: -0.000084, attention_score_distillation_loss: 0.000024 loss: 0.016756, lagrangian_loss: -0.001612, attention_score_distillation_loss: 0.000024 ---------------------------------------------------------------------- time: 2023-07-19 15:01:23 Evaluating: matthews_correlation: 0.5717, eval_loss: 0.6127, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1868, expected_sparsity: 0.1736, expected_sequence_sparsity: 0.8516, target_sparsity: 0.2021, step: 6300 lambda_1: 0.1022, lambda_2: 30.9545 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.98 0.96 0.84 0.53 0.06] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85, 0.55, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85, 0.47, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111011100 11110110010110001100 00000000000000000001 loss: 0.291651, lagrangian_loss: 0.001141, attention_score_distillation_loss: 0.000023 loss: 0.142772, lagrangian_loss: 0.003911, attention_score_distillation_loss: 0.000024 ---------------------------------------------------------------------- time: 2023-07-19 15:01:35 Evaluating: matthews_correlation: 0.5959, eval_loss: 0.595, token_prune_loc: [False, False, False, False, False, False, False, True, True, True], macs_sparsity: 0.1868, expected_sparsity: 0.1829, expected_sequence_sparsity: 0.8533, target_sparsity: 0.2037, step: 6350 lambda_1: -0.9785, lambda_2: 31.3738 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.98 0.95 0.82 0.52 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.5, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.4, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111011111011100 11110110010110001000 00001000000000000000 loss: 0.040490, lagrangian_loss: 0.005482, attention_score_distillation_loss: 0.000024 loss: 0.007750, lagrangian_loss: 0.004256, attention_score_distillation_loss: 0.000023 ---------------------------------------------------------------------- time: 2023-07-19 15:01:48 Evaluating: matthews_correlation: 0.5805, eval_loss: 0.6077, token_prune_loc: [False, False, False, False, False, False, True, True, True, True], macs_sparsity: 0.2096, expected_sparsity: 0.1997, expected_sequence_sparsity: 0.8563, target_sparsity: 0.2053, step: 6400 lambda_1: -1.5226, lambda_2: 31.5138 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.98 0.94 0.8 0.51 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.8, 0.5, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.72, 0.36, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111011111011100 11110110010110001000 00000000000000000001 loss: 0.010901, lagrangian_loss: -0.001067, attention_score_distillation_loss: 0.000024 loss: 0.025414, lagrangian_loss: -0.005275, attention_score_distillation_loss: 0.000025 ETA: 1:24:17 | Epoch 23 finished. Took 64.63 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:02:00 Evaluating: matthews_correlation: 0.5933, eval_loss: 0.6043, token_prune_loc: [False, False, False, False, False, False, True, True, True, True], macs_sparsity: 0.2096, expected_sparsity: 0.1997, expected_sequence_sparsity: 0.8563, target_sparsity: 0.2069, step: 6450 lambda_1: -1.3082, lambda_2: 31.5624 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.99 0.97 0.93 0.79 0.5 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.8, 0.5, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.72, 0.36, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111011111011100 11110110010110001000 00000000000000000001 loss: 0.011725, lagrangian_loss: -0.005129, attention_score_distillation_loss: 0.000024 loss: 0.031535, lagrangian_loss: -0.003189, attention_score_distillation_loss: 0.000022 ---------------------------------------------------------------------- time: 2023-07-19 15:02:13 Evaluating: matthews_correlation: 0.5874, eval_loss: 0.6127, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2324, expected_sparsity: 0.211, expected_sequence_sparsity: 0.8584, target_sparsity: 0.2085, step: 6500 lambda_1: -0.5824, lambda_2: 31.7570 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.99 0.97 0.92 0.78 0.49 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.8, 0.5, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.68, 0.34, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011111011100 11110110010110001000 00000000000000000001 loss: 0.054788, lagrangian_loss: -0.002071, attention_score_distillation_loss: 0.000023 loss: 0.013980, lagrangian_loss: 0.000066, attention_score_distillation_loss: 0.000023 ---------------------------------------------------------------------- time: 2023-07-19 15:02:25 Evaluating: matthews_correlation: 0.595, eval_loss: 0.6094, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2324, expected_sparsity: 0.211, expected_sequence_sparsity: 0.8584, target_sparsity: 0.2101, step: 6550 lambda_1: 0.2173, lambda_2: 31.9953 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.99 0.97 0.93 0.78 0.49 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.8, 0.5, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.68, 0.34, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011111011100 11110110010110001000 00000010000000000000 loss: 0.056700, lagrangian_loss: -0.000004, attention_score_distillation_loss: 0.000022 loss: 0.013869, lagrangian_loss: 0.000322, attention_score_distillation_loss: 0.000024 ---------------------------------------------------------------------- time: 2023-07-19 15:02:37 Evaluating: matthews_correlation: 0.5899, eval_loss: 0.6163, token_prune_loc: [False, False, False, False, False, False, True, True, True, True], macs_sparsity: 0.2096, expected_sparsity: 0.1997, expected_sequence_sparsity: 0.8563, target_sparsity: 0.2117, step: 6600 lambda_1: 0.2802, lambda_2: 32.0544 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.99 0.98 0.94 0.79 0.5 0.06] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.8, 0.5, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.72, 0.36, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111011111011100 11110110010110001000 00000000000000001000 loss: 0.013233, lagrangian_loss: -0.000594, attention_score_distillation_loss: 0.000023 loss: 0.011961, lagrangian_loss: 0.003230, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 15:02:50 Evaluating: matthews_correlation: 0.593, eval_loss: 0.6061, token_prune_loc: [False, False, False, False, False, False, True, True, True, True], macs_sparsity: 0.2096, expected_sparsity: 0.1997, expected_sequence_sparsity: 0.8563, target_sparsity: 0.2133, step: 6650 lambda_1: -0.5130, lambda_2: 32.2908 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.99 0.98 0.94 0.78 0.5 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.8, 0.5, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.72, 0.36, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111011111011100 11110110010110001000 00000000000000001000 loss: 0.020059, lagrangian_loss: 0.002119, attention_score_distillation_loss: 0.000023 loss: 0.051383, lagrangian_loss: 0.001317, attention_score_distillation_loss: 0.000023 ---------------------------------------------------------------------- time: 2023-07-19 15:03:02 Evaluating: matthews_correlation: 0.5871, eval_loss: 0.6062, token_prune_loc: [False, False, False, False, False, False, True, True, True, True], macs_sparsity: 0.2172, expected_sparsity: 0.2049, expected_sequence_sparsity: 0.8573, target_sparsity: 0.2149, step: 6700 lambda_1: -1.1349, lambda_2: 32.4474 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.99 0.97 0.93 0.77 0.49 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.75, 0.5, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.68, 0.34, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111011011011100 11110110010110001000 00000000000000000001 ETA: 1:23:22 | Epoch 24 finished. Took 70.2 seconds. loss: 0.057993, lagrangian_loss: 0.003038, attention_score_distillation_loss: 0.000022 loss: 0.018193, lagrangian_loss: -0.000103, attention_score_distillation_loss: 0.000023 ---------------------------------------------------------------------- time: 2023-07-19 15:03:15 Evaluating: matthews_correlation: 0.605, eval_loss: 0.6004, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2324, expected_sparsity: 0.216, expected_sequence_sparsity: 0.8593, target_sparsity: 0.2165, step: 6750 lambda_1: -1.3062, lambda_2: 32.4893 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.99 0.97 0.92 0.76 0.48 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.75, 0.5, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.64, 0.32, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011011011100 11110110010110001000 00000000000000000001 loss: 0.036383, lagrangian_loss: -0.002991, attention_score_distillation_loss: 0.000023 loss: 0.021847, lagrangian_loss: -0.000308, attention_score_distillation_loss: 0.000022 ---------------------------------------------------------------------- time: 2023-07-19 15:03:27 Evaluating: matthews_correlation: 0.5922, eval_loss: 0.6012, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2324, expected_sparsity: 0.216, expected_sequence_sparsity: 0.8593, target_sparsity: 0.2181, step: 6800 lambda_1: -1.1047, lambda_2: 32.5225 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.99 0.97 0.92 0.76 0.48 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.75, 0.5, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.64, 0.32, 0.02] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011011011100 11110110010110001000 00000000000000000001 loss: 0.032245, lagrangian_loss: -0.001734, attention_score_distillation_loss: 0.000022 loss: 0.008419, lagrangian_loss: 0.001129, attention_score_distillation_loss: 0.000022 ---------------------------------------------------------------------- time: 2023-07-19 15:03:40 Evaluating: matthews_correlation: 0.5973, eval_loss: 0.6076, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2399, expected_sparsity: 0.2185, expected_sequence_sparsity: 0.8598, target_sparsity: 0.2197, step: 6850 lambda_1: -1.0561, lambda_2: 32.5447 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.99 0.97 0.92 0.76 0.48 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.75, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.64, 0.29, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011011011100 11110110000110001000 00000000000000000001 loss: 0.568919, lagrangian_loss: 0.000183, attention_score_distillation_loss: 0.000022 loss: 0.085156, lagrangian_loss: -0.000182, attention_score_distillation_loss: 0.000022 ---------------------------------------------------------------------- time: 2023-07-19 15:03:52 Evaluating: matthews_correlation: 0.6105, eval_loss: 0.5931, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2399, expected_sparsity: 0.2185, expected_sequence_sparsity: 0.8598, target_sparsity: 0.2214, step: 6900 lambda_1: -1.1003, lambda_2: 32.5618 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.99 0.97 0.92 0.76 0.47 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.75, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.64, 0.29, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011011011100 11110110000110001000 00000000000000000001 loss: 0.017220, lagrangian_loss: 0.000001, attention_score_distillation_loss: 0.000022 loss: 0.094591, lagrangian_loss: 0.000681, attention_score_distillation_loss: 0.000022 ---------------------------------------------------------------------- time: 2023-07-19 15:04:05 Evaluating: matthews_correlation: 0.5846, eval_loss: 0.6043, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2399, expected_sparsity: 0.2185, expected_sequence_sparsity: 0.8598, target_sparsity: 0.223, step: 6950 lambda_1: -1.0619, lambda_2: 32.5806 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.99 0.97 0.92 0.75 0.46 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.75, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.64, 0.29, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011011011100 11110110000110001000 00000010000000000000 loss: 0.022078, lagrangian_loss: -0.001524, attention_score_distillation_loss: 0.000021 ETA: 1:22:10 | Epoch 25 finished. Took 64.81 seconds. loss: 0.060404, lagrangian_loss: -0.001184, attention_score_distillation_loss: 0.000021 ---------------------------------------------------------------------- time: 2023-07-19 15:04:17 Evaluating: matthews_correlation: 0.6038, eval_loss: 0.5979, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2399, expected_sparsity: 0.2185, expected_sequence_sparsity: 0.8598, target_sparsity: 0.2246, step: 7000 lambda_1: -0.8470, lambda_2: 32.6176 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.99 0.97 0.91 0.75 0.45 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.75, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.64, 0.29, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011011011100 11110110000110001000 00001000000000000000 loss: 0.019870, lagrangian_loss: -0.002206, attention_score_distillation_loss: 0.000022 loss: 0.017761, lagrangian_loss: -0.000402, attention_score_distillation_loss: 0.000021 ---------------------------------------------------------------------- time: 2023-07-19 15:04:30 Evaluating: matthews_correlation: 0.5922, eval_loss: 0.6116, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2399, expected_sparsity: 0.2185, expected_sequence_sparsity: 0.8598, target_sparsity: 0.2262, step: 7050 lambda_1: -0.5860, lambda_2: 32.6606 lambda_3: 0.0000 train remain: [1. 1. 1. 0.98 0.99 0.97 0.91 0.75 0.44 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.75, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.64, 0.29, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011011011100 11110110000110001000 00001000000000000000 loss: 0.023008, lagrangian_loss: -0.001594, attention_score_distillation_loss: 0.000022 loss: 0.019497, lagrangian_loss: -0.001094, attention_score_distillation_loss: 0.000022 ---------------------------------------------------------------------- time: 2023-07-19 15:04:42 Evaluating: matthews_correlation: 0.592, eval_loss: 0.6088, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2399, expected_sparsity: 0.2185, expected_sequence_sparsity: 0.8598, target_sparsity: 0.2278, step: 7100 lambda_1: -0.5008, lambda_2: 32.6861 lambda_3: 0.0000 train remain: [1. 1. 1. 0.98 0.99 0.97 0.91 0.75 0.44 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.75, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.64, 0.29, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011011011100 11110110000110001000 00000000000100000000 loss: 0.012587, lagrangian_loss: -0.000490, attention_score_distillation_loss: 0.000022 loss: 0.018500, lagrangian_loss: -0.000105, attention_score_distillation_loss: 0.000022 ---------------------------------------------------------------------- time: 2023-07-19 15:04:55 Evaluating: matthews_correlation: 0.5788, eval_loss: 0.6251, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2399, expected_sparsity: 0.2185, expected_sequence_sparsity: 0.8598, target_sparsity: 0.2294, step: 7150 lambda_1: -0.6290, lambda_2: 32.7102 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.99 0.97 0.91 0.75 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.75, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.64, 0.29, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011011011100 11110110000110001000 00000000000000000001 loss: 0.028707, lagrangian_loss: 0.001305, attention_score_distillation_loss: 0.000021 loss: 0.025362, lagrangian_loss: 0.002263, attention_score_distillation_loss: 0.000021 ---------------------------------------------------------------------- time: 2023-07-19 15:05:07 Evaluating: matthews_correlation: 0.6047, eval_loss: 0.6074, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2399, expected_sparsity: 0.2185, expected_sequence_sparsity: 0.8598, target_sparsity: 0.231, step: 7200 lambda_1: -1.2665, lambda_2: 32.8411 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.98 0.97 0.91 0.75 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.75, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.64, 0.29, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011011011100 11110110000110001000 00000010000000000000 loss: 0.263103, lagrangian_loss: 0.006303, attention_score_distillation_loss: 0.000020 loss: 0.038085, lagrangian_loss: 0.004649, attention_score_distillation_loss: 0.000021 ETA: 1:20:59 | Epoch 26 finished. Took 64.96 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:05:19 Evaluating: matthews_correlation: 0.579, eval_loss: 0.6212, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2399, expected_sparsity: 0.2185, expected_sequence_sparsity: 0.8598, target_sparsity: 0.2326, step: 7250 lambda_1: -1.9110, lambda_2: 32.9732 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.98 0.97 0.91 0.73 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.75, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.64, 0.29, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011011011100 11110110000100001001 00000000000000000001 loss: 0.023476, lagrangian_loss: 0.000959, attention_score_distillation_loss: 0.000021 loss: 0.014060, lagrangian_loss: 0.002324, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 15:05:32 Evaluating: matthews_correlation: 0.5922, eval_loss: 0.6049, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2399, expected_sparsity: 0.2233, expected_sequence_sparsity: 0.8607, target_sparsity: 0.2342, step: 7300 lambda_1: -2.1563, lambda_2: 33.0205 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.99 0.98 0.97 0.91 0.71 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.7, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.6, 0.27, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011001011100 11110110000100001001 00000000000000000001 loss: 0.008043, lagrangian_loss: 0.009290, attention_score_distillation_loss: 0.000020 loss: 0.145377, lagrangian_loss: 0.000114, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 15:05:44 Evaluating: matthews_correlation: 0.5839, eval_loss: 0.6212, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2399, expected_sparsity: 0.2233, expected_sequence_sparsity: 0.8607, target_sparsity: 0.2358, step: 7350 lambda_1: -2.1814, lambda_2: 33.0706 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.99 0.98 0.96 0.9 0.69 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.7, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.6, 0.27, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011001011100 11110110000100001001 00000000000000000001 loss: 0.016524, lagrangian_loss: -0.003541, attention_score_distillation_loss: 0.000020 loss: 0.029403, lagrangian_loss: -0.007358, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 15:05:57 Evaluating: matthews_correlation: 0.5948, eval_loss: 0.6075, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2475, expected_sparsity: 0.2281, expected_sequence_sparsity: 0.8615, target_sparsity: 0.2374, step: 7400 lambda_1: -1.4666, lambda_2: 33.2553 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.99 0.98 0.96 0.89 0.66 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.65, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.56, 0.25, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011001010100 11110110000100001100 00000000000000000001 loss: 0.055722, lagrangian_loss: -0.007677, attention_score_distillation_loss: 0.000020 loss: 0.130857, lagrangian_loss: -0.003440, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 15:06:09 Evaluating: matthews_correlation: 0.595, eval_loss: 0.6073, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2475, expected_sparsity: 0.2281, expected_sequence_sparsity: 0.8615, target_sparsity: 0.239, step: 7450 lambda_1: 0.1395, lambda_2: 34.0576 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.98 0.96 0.89 0.65 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.65, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.56, 0.25, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011001010100 11110110000100011000 00001000000000000000 loss: 0.013592, lagrangian_loss: 0.001091, attention_score_distillation_loss: 0.000019 loss: 0.008476, lagrangian_loss: 0.002800, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 15:06:22 Evaluating: matthews_correlation: 0.5816, eval_loss: 0.6167, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2475, expected_sparsity: 0.2281, expected_sequence_sparsity: 0.8615, target_sparsity: 0.2406, step: 7500 lambda_1: 0.9611, lambda_2: 34.4016 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.99 0.98 0.96 0.91 0.67 0.44 0.06] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.65, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.56, 0.25, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011001010100 11110110000100001001 00000001000000000000 loss: 0.016059, lagrangian_loss: -0.002496, attention_score_distillation_loss: 0.000020 ETA: 1:20:01 | Epoch 27 finished. Took 70.01 seconds. loss: 0.072798, lagrangian_loss: -0.002035, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 15:06:34 Evaluating: matthews_correlation: 0.5767, eval_loss: 0.6193, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2399, expected_sparsity: 0.2233, expected_sequence_sparsity: 0.8607, target_sparsity: 0.2422, step: 7550 lambda_1: -0.3093, lambda_2: 35.0734 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.99 0.97 0.91 0.68 0.44 0.07] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.7, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.6, 0.27, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011001011100 11110110000100001001 00000000000000000001 loss: 0.030122, lagrangian_loss: 0.004169, attention_score_distillation_loss: 0.000019 loss: 0.164872, lagrangian_loss: 0.001114, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 15:06:47 Evaluating: matthews_correlation: 0.5892, eval_loss: 0.615, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2475, expected_sparsity: 0.2281, expected_sequence_sparsity: 0.8615, target_sparsity: 0.2438, step: 7600 lambda_1: -1.3208, lambda_2: 35.5525 lambda_3: 0.0000 train remain: [1. 1. 1. 0.98 0.98 0.97 0.9 0.65 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.65, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.56, 0.25, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111011001010100 11110110000100101000 00000001000000000000 loss: 0.017901, lagrangian_loss: 0.002076, attention_score_distillation_loss: 0.000019 loss: 0.141312, lagrangian_loss: -0.002981, attention_score_distillation_loss: 0.000019 ---------------------------------------------------------------------- time: 2023-07-19 15:06:59 Evaluating: matthews_correlation: 0.5976, eval_loss: 0.61, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2475, expected_sparsity: 0.235, expected_sequence_sparsity: 0.8628, target_sparsity: 0.2454, step: 7650 lambda_1: -1.2270, lambda_2: 35.6177 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.98 0.98 0.97 0.89 0.62 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.6, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.51, 0.21, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111010001010100 11110110000100001000 00000010000000000000 loss: 0.047547, lagrangian_loss: -0.002135, attention_score_distillation_loss: 0.000019 loss: 0.013926, lagrangian_loss: -0.000960, attention_score_distillation_loss: 0.000019 ---------------------------------------------------------------------- time: 2023-07-19 15:07:11 Evaluating: matthews_correlation: 0.5838, eval_loss: 0.6376, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2626, expected_sparsity: 0.2416, expected_sequence_sparsity: 0.864, target_sparsity: 0.247, step: 7700 lambda_1: -0.8455, lambda_2: 35.6987 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.98 0.98 0.97 0.88 0.62 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.85, 0.6, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.81, 0.48, 0.19, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111110111111100 11111111010001010100 11110110000100001000 00000000000100000000 loss: 0.020657, lagrangian_loss: -0.000704, attention_score_distillation_loss: 0.000019 loss: 0.008797, lagrangian_loss: 0.001107, attention_score_distillation_loss: 0.000019 ---------------------------------------------------------------------- time: 2023-07-19 15:07:24 Evaluating: matthews_correlation: 0.5859, eval_loss: 0.6142, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2551, expected_sparsity: 0.2397, expected_sequence_sparsity: 0.8636, target_sparsity: 0.2486, step: 7750 lambda_1: -0.5756, lambda_2: 35.7705 lambda_3: 0.0000 train remain: [1. 1. 1. 0.98 0.98 0.97 0.87 0.62 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.85, 0.6, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.81, 0.48, 0.22, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111110111111100 11111111010001010100 11110110000100001100 00000000000000000001 loss: 0.007033, lagrangian_loss: 0.001595, attention_score_distillation_loss: 0.000018 ETA: 1:18:49 | Epoch 28 finished. Took 64.65 seconds. loss: 0.147322, lagrangian_loss: -0.000591, attention_score_distillation_loss: 0.000019 ---------------------------------------------------------------------- time: 2023-07-19 15:07:36 Evaluating: matthews_correlation: 0.5874, eval_loss: 0.6165, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2551, expected_sparsity: 0.2397, expected_sequence_sparsity: 0.8636, target_sparsity: 0.2502, step: 7800 lambda_1: -0.3727, lambda_2: 35.8280 lambda_3: 0.0000 train remain: [1. 1. 1. 0.98 0.98 0.97 0.87 0.61 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.85, 0.6, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.81, 0.48, 0.22, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111110111111100 11111111010001010100 11110110000100101000 00000000000100000000 loss: 0.012161, lagrangian_loss: -0.000141, attention_score_distillation_loss: 0.000019 loss: 0.013375, lagrangian_loss: -0.000416, attention_score_distillation_loss: 0.000019 ---------------------------------------------------------------------- time: 2023-07-19 15:07:48 Evaluating: matthews_correlation: 0.5808, eval_loss: 0.6158, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2551, expected_sparsity: 0.2397, expected_sequence_sparsity: 0.8636, target_sparsity: 0.2518, step: 7850 lambda_1: -0.4445, lambda_2: 35.8689 lambda_3: 0.0000 train remain: [1. 1. 1. 0.98 0.98 0.97 0.87 0.61 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.85, 0.6, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.81, 0.48, 0.22, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111110111111100 11111111010001010100 11110110000100101000 00001000000000000000 loss: 0.068100, lagrangian_loss: 0.000685, attention_score_distillation_loss: 0.000019 loss: 0.015647, lagrangian_loss: 0.003286, attention_score_distillation_loss: 0.000018 ---------------------------------------------------------------------- time: 2023-07-19 15:08:01 Evaluating: matthews_correlation: 0.5936, eval_loss: 0.6123, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2551, expected_sparsity: 0.2397, expected_sequence_sparsity: 0.8636, target_sparsity: 0.2534, step: 7900 lambda_1: -1.1473, lambda_2: 36.0596 lambda_3: 0.0000 train remain: [1. 1. 1. 0.98 0.98 0.97 0.86 0.61 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.85, 0.6, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.81, 0.48, 0.22, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111110111111100 11111111010001010100 11110110000100001001 00000000000000000001 loss: 0.010842, lagrangian_loss: 0.000337, attention_score_distillation_loss: 0.000019 loss: 0.048490, lagrangian_loss: 0.002049, attention_score_distillation_loss: 0.000019 ---------------------------------------------------------------------- time: 2023-07-19 15:08:13 Evaluating: matthews_correlation: 0.5741, eval_loss: 0.6317, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2626, expected_sparsity: 0.2416, expected_sequence_sparsity: 0.864, target_sparsity: 0.255, step: 7950 lambda_1: -1.9970, lambda_2: 36.3266 lambda_3: 0.0000 train remain: [1. 1. 1. 0.98 0.98 0.96 0.86 0.61 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.85, 0.6, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.81, 0.48, 0.19, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111110111111100 11111111010001010100 11110110000100001000 00000000000000000001 loss: 0.186622, lagrangian_loss: -0.002070, attention_score_distillation_loss: 0.000019 loss: 0.015360, lagrangian_loss: 0.006701, attention_score_distillation_loss: 0.000018 ---------------------------------------------------------------------- time: 2023-07-19 15:08:26 Evaluating: matthews_correlation: 0.5769, eval_loss: 0.6309, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.2626, expected_sparsity: 0.2416, expected_sequence_sparsity: 0.864, target_sparsity: 0.2567, step: 8000 lambda_1: -2.4785, lambda_2: 36.4528 lambda_3: 0.0000 train remain: [1. 1. 1. 0.98 0.98 0.96 0.86 0.61 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.85, 0.6, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.81, 0.48, 0.19, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111110111111100 11111111010001010100 11110110000100001000 00000000000000000001 loss: 0.091817, lagrangian_loss: 0.003654, attention_score_distillation_loss: 0.000018 loss: 0.009841, lagrangian_loss: 0.004848, attention_score_distillation_loss: 0.000017 ETA: 1:17:38 | Epoch 29 finished. Took 64.54 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:08:38 Evaluating: matthews_correlation: 0.5743, eval_loss: 0.6255, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2778, expected_sparsity: 0.2546, expected_sequence_sparsity: 0.8664, target_sparsity: 0.2583, step: 8050 lambda_1: -2.3743, lambda_2: 36.5266 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.98 0.97 0.96 0.86 0.61 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.85, 0.6, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.77, 0.46, 0.18, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111110 11111111110111111100 11111111010001010100 11110110000100001000 00000000000000000001 loss: 0.014136, lagrangian_loss: 0.000058, attention_score_distillation_loss: 0.000018 loss: 0.055208, lagrangian_loss: -0.005707, attention_score_distillation_loss: 0.000017 ---------------------------------------------------------------------- time: 2023-07-19 15:08:50 Evaluating: matthews_correlation: 0.5732, eval_loss: 0.6448, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.293, expected_sparsity: 0.2707, expected_sequence_sparsity: 0.8693, target_sparsity: 0.2599, step: 8100 lambda_1: -1.9173, lambda_2: 36.6391 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.97 0.97 0.96 0.85 0.6 0.42 0.05] infer remain: [1.0, 1.0, 1.0, 0.95, 0.95, 0.95, 0.85, 0.6, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.86, 0.73, 0.44, 0.17, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111110 11111111111111111110 11111111110111111100 11111111010001010100 11110110000100001000 00000000000000000001 loss: 0.018682, lagrangian_loss: -0.001692, attention_score_distillation_loss: 0.000017 loss: 0.022946, lagrangian_loss: -0.004632, attention_score_distillation_loss: 0.000018 ---------------------------------------------------------------------- time: 2023-07-19 15:09:03 Evaluating: matthews_correlation: 0.5823, eval_loss: 0.6247, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.293, expected_sparsity: 0.2707, expected_sequence_sparsity: 0.8693, target_sparsity: 0.2615, step: 8150 lambda_1: -0.9746, lambda_2: 36.9650 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.97 0.97 0.96 0.85 0.59 0.42 0.05] infer remain: [1.0, 1.0, 1.0, 0.95, 0.95, 0.95, 0.85, 0.6, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.86, 0.73, 0.44, 0.17, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111110 11111111111111111110 11111111110111111100 11111111010001010100 11110110000100001000 00000000000000000001 loss: 0.006860, lagrangian_loss: -0.004193, attention_score_distillation_loss: 0.000018 loss: 0.012668, lagrangian_loss: -0.001015, attention_score_distillation_loss: 0.000018 ---------------------------------------------------------------------- time: 2023-07-19 15:09:15 Evaluating: matthews_correlation: 0.5927, eval_loss: 0.6161, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2778, expected_sparsity: 0.2546, expected_sequence_sparsity: 0.8664, target_sparsity: 0.2631, step: 8200 lambda_1: 0.0458, lambda_2: 37.3486 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.97 0.97 0.96 0.85 0.59 0.42 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.85, 0.6, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.77, 0.46, 0.18, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111110 11111111110111111100 11111111010001010100 11110110000100001000 00000000000000010000 loss: 0.076241, lagrangian_loss: 0.000847, attention_score_distillation_loss: 0.000018 loss: 0.014821, lagrangian_loss: -0.000304, attention_score_distillation_loss: 0.000017 ---------------------------------------------------------------------- time: 2023-07-19 15:09:28 Evaluating: matthews_correlation: 0.5876, eval_loss: 0.6175, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2778, expected_sparsity: 0.2546, expected_sequence_sparsity: 0.8664, target_sparsity: 0.2647, step: 8250 lambda_1: 0.1485, lambda_2: 37.4541 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.98 0.97 0.96 0.85 0.59 0.43 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.85, 0.6, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.77, 0.46, 0.18, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111110 11111111110111111100 11111111010001010100 11110110000100001000 00000000000000000001 loss: 0.008767, lagrangian_loss: -0.000110, attention_score_distillation_loss: 0.000017 loss: 0.025605, lagrangian_loss: 0.000364, attention_score_distillation_loss: 0.000017 ---------------------------------------------------------------------- time: 2023-07-19 15:09:40 Evaluating: matthews_correlation: 0.5956, eval_loss: 0.6102, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2778, expected_sparsity: 0.2546, expected_sequence_sparsity: 0.8664, target_sparsity: 0.2663, step: 8300 lambda_1: -0.4544, lambda_2: 37.6452 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.97 0.97 0.96 0.85 0.59 0.42 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.85, 0.6, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.77, 0.46, 0.18, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111110 11111111110111111100 11111111010001010100 11110110000100001000 00000000000000100000 loss: 0.098671, lagrangian_loss: 0.001773, attention_score_distillation_loss: 0.000017 ETA: 1:16:39 | Epoch 30 finished. Took 70.14 seconds. loss: 0.018004, lagrangian_loss: 0.002134, attention_score_distillation_loss: 0.000018 ---------------------------------------------------------------------- time: 2023-07-19 15:09:53 Evaluating: matthews_correlation: 0.5941, eval_loss: 0.6235, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2778, expected_sparsity: 0.2546, expected_sequence_sparsity: 0.8664, target_sparsity: 0.2679, step: 8350 lambda_1: -1.3001, lambda_2: 37.9272 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.97 0.97 0.96 0.86 0.58 0.42 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.85, 0.6, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.77, 0.46, 0.18, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111110 11111111110111111100 11111111010001010100 11110110000100001000 00000000000000100000 loss: 0.008726, lagrangian_loss: 0.009430, attention_score_distillation_loss: 0.000016 loss: 0.122726, lagrangian_loss: 0.003343, attention_score_distillation_loss: 0.000017 ---------------------------------------------------------------------- time: 2023-07-19 15:10:05 Evaluating: matthews_correlation: 0.5851, eval_loss: 0.6255, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.293, expected_sparsity: 0.2747, expected_sequence_sparsity: 0.87, target_sparsity: 0.2695, step: 8400 lambda_1: -2.1710, lambda_2: 38.2526 lambda_3: 0.0000 train remain: [1. 0.99 0.99 0.97 0.96 0.96 0.85 0.57 0.42 0.05] infer remain: [1.0, 1.0, 1.0, 0.95, 0.95, 0.95, 0.85, 0.55, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.86, 0.73, 0.4, 0.16, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111110 11111111111111111110 11111111110111111100 01111111010001010100 11110110000100001000 00000000000000000001 loss: 0.097895, lagrangian_loss: -0.001885, attention_score_distillation_loss: 0.000017 loss: 0.023614, lagrangian_loss: -0.005602, attention_score_distillation_loss: 0.000017 ---------------------------------------------------------------------- time: 2023-07-19 15:10:18 Evaluating: matthews_correlation: 0.5953, eval_loss: 0.6009, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.293, expected_sparsity: 0.2747, expected_sequence_sparsity: 0.87, target_sparsity: 0.2711, step: 8450 lambda_1: -2.0348, lambda_2: 38.3541 lambda_3: 0.0000 train remain: [1. 0.99 0.99 0.96 0.96 0.96 0.85 0.56 0.42 0.05] infer remain: [1.0, 1.0, 1.0, 0.95, 0.95, 0.95, 0.85, 0.55, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.86, 0.73, 0.4, 0.16, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111110 11111111111111111110 11111111110111111100 01111111010001010100 11110110000100001000 00000000000000001000 loss: 0.022478, lagrangian_loss: -0.005173, attention_score_distillation_loss: 0.000017 loss: 0.005744, lagrangian_loss: -0.006575, attention_score_distillation_loss: 0.000017 ---------------------------------------------------------------------- time: 2023-07-19 15:10:30 Evaluating: matthews_correlation: 0.5975, eval_loss: 0.6014, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3158, expected_sparsity: 0.2905, expected_sequence_sparsity: 0.8729, target_sparsity: 0.2727, step: 8500 lambda_1: -1.1087, lambda_2: 38.7267 lambda_3: 0.0000 train remain: [0.99 0.99 0.99 0.96 0.96 0.96 0.84 0.55 0.42 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.95, 0.85, 0.55, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.81, 0.69, 0.38, 0.15, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111111111111110 11111111110111111100 01111111010001010100 11110110000100001000 00000000000000000001 loss: 0.067958, lagrangian_loss: -0.002751, attention_score_distillation_loss: 0.000016 loss: 0.011846, lagrangian_loss: -0.000893, attention_score_distillation_loss: 0.000015 ---------------------------------------------------------------------- time: 2023-07-19 15:10:43 Evaluating: matthews_correlation: 0.5879, eval_loss: 0.6154, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.293, expected_sparsity: 0.2747, expected_sequence_sparsity: 0.87, target_sparsity: 0.2743, step: 8550 lambda_1: 0.0190, lambda_2: 39.2257 lambda_3: 0.0000 train remain: [0.99 0.99 0.99 0.96 0.96 0.96 0.85 0.55 0.42 0.05] infer remain: [1.0, 1.0, 1.0, 0.95, 0.95, 0.95, 0.85, 0.55, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.86, 0.73, 0.4, 0.16, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111110 11111111111111111110 11111111110111111100 01111111010001010100 11110110000100001000 00000000000100000000 loss: 0.024509, lagrangian_loss: 0.001550, attention_score_distillation_loss: 0.000017 loss: 0.011548, lagrangian_loss: 0.000778, attention_score_distillation_loss: 0.000017 ETA: 1:15:29 | Epoch 31 finished. Took 65.02 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:10:55 Evaluating: matthews_correlation: 0.5859, eval_loss: 0.6123, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2778, expected_sparsity: 0.257, expected_sequence_sparsity: 0.8668, target_sparsity: 0.2759, step: 8600 lambda_1: 0.4135, lambda_2: 39.4988 lambda_3: 0.0000 train remain: [0.99 0.99 0.99 0.97 0.96 0.96 0.85 0.56 0.43 0.06] infer remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.85, 0.55, 0.45, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.77, 0.42, 0.19, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111110 11111111110111111100 01111111010001010100 11110110000100011000 00000000000000010000 loss: 0.071836, lagrangian_loss: -0.000998, attention_score_distillation_loss: 0.000016 loss: 0.005074, lagrangian_loss: 0.005812, attention_score_distillation_loss: 0.000015 ---------------------------------------------------------------------- time: 2023-07-19 15:11:08 Evaluating: matthews_correlation: 0.5959, eval_loss: 0.6094, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2778, expected_sparsity: 0.2587, expected_sequence_sparsity: 0.8671, target_sparsity: 0.2775, step: 8650 lambda_1: -1.0004, lambda_2: 40.3626 lambda_3: 0.0000 train remain: [0.99 0.99 0.99 0.97 0.96 0.96 0.85 0.56 0.42 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.85, 0.55, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.77, 0.42, 0.17, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111110 11111111110111111100 01111111010001010100 11110110000100001000 00000000000000010000 loss: 0.045408, lagrangian_loss: 0.005748, attention_score_distillation_loss: 0.000016 loss: 0.010639, lagrangian_loss: 0.010003, attention_score_distillation_loss: 0.000016 ---------------------------------------------------------------------- time: 2023-07-19 15:11:20 Evaluating: matthews_correlation: 0.5848, eval_loss: 0.6225, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2778, expected_sparsity: 0.2587, expected_sequence_sparsity: 0.8671, target_sparsity: 0.2791, step: 8700 lambda_1: -2.1741, lambda_2: 40.9719 lambda_3: 0.0000 train remain: [0.99 0.99 0.99 0.97 0.96 0.96 0.84 0.55 0.42 0.05] infer remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.95, 0.85, 0.55, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.95, 0.9, 0.77, 0.42, 0.17, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111110 11111111110111111100 01111111010001010100 11110110000100001000 00000000000000000001 loss: 0.021723, lagrangian_loss: 0.002141, attention_score_distillation_loss: 0.000016 loss: 0.017137, lagrangian_loss: 0.007935, attention_score_distillation_loss: 0.000015 ---------------------------------------------------------------------- time: 2023-07-19 15:11:33 Evaluating: matthews_correlation: 0.6016, eval_loss: 0.5995, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3158, expected_sparsity: 0.2905, expected_sequence_sparsity: 0.8729, target_sparsity: 0.2807, step: 8750 lambda_1: -2.6517, lambda_2: 41.1751 lambda_3: 0.0000 train remain: [1. 0.99 0.99 0.96 0.96 0.95 0.83 0.53 0.42 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.95, 0.85, 0.55, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.81, 0.69, 0.38, 0.15, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111111111111110 11111111110111111100 01111111010001010100 11110110000100001000 00000000000000000001 loss: 0.278577, lagrangian_loss: -0.002250, attention_score_distillation_loss: 0.000016 loss: 0.006810, lagrangian_loss: -0.004204, attention_score_distillation_loss: 0.000015 ---------------------------------------------------------------------- time: 2023-07-19 15:11:45 Evaluating: matthews_correlation: 0.5808, eval_loss: 0.6106, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3158, expected_sparsity: 0.2995, expected_sequence_sparsity: 0.8746, target_sparsity: 0.2823, step: 8800 lambda_1: -2.0905, lambda_2: 41.4280 lambda_3: 0.0000 train remain: [1. 0.99 0.99 0.95 0.96 0.95 0.82 0.52 0.42 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.95, 0.8, 0.5, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.81, 0.65, 0.32, 0.13, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111111111111110 11111111110101111100 01111110010001010100 11110110000100001000 00000000000000000001 loss: 0.145027, lagrangian_loss: -0.011506, attention_score_distillation_loss: 0.000016 loss: 0.041731, lagrangian_loss: -0.006108, attention_score_distillation_loss: 0.000016 ETA: 1:14:19 | Epoch 32 finished. Took 64.73 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:11:57 Evaluating: matthews_correlation: 0.5825, eval_loss: 0.6211, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3158, expected_sparsity: 0.2995, expected_sequence_sparsity: 0.8746, target_sparsity: 0.2839, step: 8850 lambda_1: -1.0478, lambda_2: 41.9386 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.95 0.96 0.95 0.82 0.52 0.41 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.95, 0.8, 0.5, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.81, 0.65, 0.32, 0.13, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111111111111110 11111111110101111100 01111110010001010100 11110110000100001000 00000000000000000001 loss: 0.245305, lagrangian_loss: -0.004150, attention_score_distillation_loss: 0.000016 loss: 0.031111, lagrangian_loss: -0.001556, attention_score_distillation_loss: 0.000015 ---------------------------------------------------------------------- time: 2023-07-19 15:12:10 Evaluating: matthews_correlation: 0.5786, eval_loss: 0.6156, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3158, expected_sparsity: 0.2995, expected_sequence_sparsity: 0.8746, target_sparsity: 0.2855, step: 8900 lambda_1: -0.4718, lambda_2: 42.1688 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.94 0.96 0.95 0.82 0.52 0.41 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.95, 0.8, 0.5, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.81, 0.65, 0.32, 0.13, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111111111111110 11111111110101111100 01111110010001010100 11110110000100001000 00010000000000000000 loss: 0.032604, lagrangian_loss: -0.001311, attention_score_distillation_loss: 0.000016 loss: 0.028618, lagrangian_loss: 0.001438, attention_score_distillation_loss: 0.000014 ---------------------------------------------------------------------- time: 2023-07-19 15:12:22 Evaluating: matthews_correlation: 0.576, eval_loss: 0.6195, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3158, expected_sparsity: 0.2995, expected_sequence_sparsity: 0.8746, target_sparsity: 0.2871, step: 8950 lambda_1: -0.2657, lambda_2: 42.2940 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.94 0.96 0.95 0.82 0.53 0.41 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.95, 0.8, 0.5, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.81, 0.65, 0.32, 0.13, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111111111111110 11111111110101111100 01111110010001010100 11110110000100001000 10000000000000000000 loss: 0.015320, lagrangian_loss: 0.001218, attention_score_distillation_loss: 0.000014 loss: 0.019936, lagrangian_loss: 0.000699, attention_score_distillation_loss: 0.000015 ---------------------------------------------------------------------- time: 2023-07-19 15:12:35 Evaluating: matthews_correlation: 0.5847, eval_loss: 0.6222, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3158, expected_sparsity: 0.2995, expected_sequence_sparsity: 0.8746, target_sparsity: 0.2887, step: 9000 lambda_1: -0.6296, lambda_2: 42.4269 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.95 0.96 0.95 0.82 0.52 0.4 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.95, 0.8, 0.5, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.81, 0.65, 0.32, 0.13, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111111111111110 11111111110101111100 01111110010001010100 11110110000100001000 00000000010000000000 loss: 0.031981, lagrangian_loss: 0.000689, attention_score_distillation_loss: 0.000015 loss: 0.033933, lagrangian_loss: 0.005801, attention_score_distillation_loss: 0.000014 ---------------------------------------------------------------------- time: 2023-07-19 15:12:47 Evaluating: matthews_correlation: 0.5799, eval_loss: 0.6314, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3233, expected_sparsity: 0.3071, expected_sequence_sparsity: 0.876, target_sparsity: 0.2903, step: 9050 lambda_1: -1.2976, lambda_2: 42.6899 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.94 0.96 0.95 0.81 0.52 0.39 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.9, 0.8, 0.5, 0.4, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.62, 0.31, 0.12, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111110111111110 11111111110101111100 01111110010001010100 11110110000100001000 00000000000000000001 loss: 0.010532, lagrangian_loss: 0.001552, attention_score_distillation_loss: 0.000015 loss: 0.041145, lagrangian_loss: -0.000397, attention_score_distillation_loss: 0.000015 ---------------------------------------------------------------------- time: 2023-07-19 15:13:00 Evaluating: matthews_correlation: 0.5777, eval_loss: 0.6215, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3233, expected_sparsity: 0.3084, expected_sequence_sparsity: 0.8762, target_sparsity: 0.292, step: 9100 lambda_1: -1.5494, lambda_2: 42.8085 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.94 0.96 0.94 0.81 0.52 0.37 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.9, 0.8, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.62, 0.31, 0.11, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111110111111110 11111111110101111100 01111110010001010100 11110110000100000000 00000000000000000001 loss: 0.386931, lagrangian_loss: -0.006330, attention_score_distillation_loss: 0.000015 ETA: 1:13:19 | Epoch 33 finished. Took 70.2 seconds. loss: 0.049552, lagrangian_loss: -0.001086, attention_score_distillation_loss: 0.000014 ---------------------------------------------------------------------- time: 2023-07-19 15:13:12 Evaluating: matthews_correlation: 0.5805, eval_loss: 0.6369, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3233, expected_sparsity: 0.3084, expected_sequence_sparsity: 0.8762, target_sparsity: 0.2936, step: 9150 lambda_1: -1.4607, lambda_2: 42.9184 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.94 0.96 0.93 0.81 0.52 0.36 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.9, 0.8, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.62, 0.31, 0.11, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111110111111110 11111111110101111100 01111110010001010100 11110110000100000000 00010000000000000000 loss: 0.042516, lagrangian_loss: -0.002015, attention_score_distillation_loss: 0.000014 loss: 0.201674, lagrangian_loss: -0.005392, attention_score_distillation_loss: 0.000015 ---------------------------------------------------------------------- time: 2023-07-19 15:13:25 Evaluating: matthews_correlation: 0.5781, eval_loss: 0.6244, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3233, expected_sparsity: 0.3084, expected_sequence_sparsity: 0.8762, target_sparsity: 0.2952, step: 9200 lambda_1: -1.0547, lambda_2: 43.0725 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.94 0.96 0.93 0.81 0.52 0.35 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.9, 0.8, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.62, 0.31, 0.11, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111110111111110 11111111110101111100 01111110010001010100 11110110000100000000 00000000000000000001 loss: 0.015654, lagrangian_loss: -0.003567, attention_score_distillation_loss: 0.000014 loss: 0.149190, lagrangian_loss: -0.002716, attention_score_distillation_loss: 0.000015 ---------------------------------------------------------------------- time: 2023-07-19 15:13:37 Evaluating: matthews_correlation: 0.5807, eval_loss: 0.6203, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3233, expected_sparsity: 0.3084, expected_sequence_sparsity: 0.8762, target_sparsity: 0.2968, step: 9250 lambda_1: -0.5209, lambda_2: 43.3083 lambda_3: 0.0000 train remain: [1. 1. 0.98 0.94 0.96 0.92 0.81 0.52 0.35 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.9, 0.8, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.62, 0.31, 0.11, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111110111111110 11111111110101111100 01111110010001010100 11110110000100000000 10000000000000000000 loss: 0.013079, lagrangian_loss: 0.001333, attention_score_distillation_loss: 0.000013 loss: 0.044101, lagrangian_loss: 0.000295, attention_score_distillation_loss: 0.000014 ---------------------------------------------------------------------- time: 2023-07-19 15:13:50 Evaluating: matthews_correlation: 0.5764, eval_loss: 0.6212, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3233, expected_sparsity: 0.3084, expected_sequence_sparsity: 0.8762, target_sparsity: 0.2984, step: 9300 lambda_1: -0.2625, lambda_2: 43.4539 lambda_3: 0.0000 train remain: [1. 1. 0.98 0.94 0.96 0.92 0.81 0.52 0.35 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.9, 0.8, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.62, 0.31, 0.11, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111110111111110 11111111110101111100 01111110010001010100 11110110000100000000 10000000000000000000 loss: 0.059778, lagrangian_loss: 0.001254, attention_score_distillation_loss: 0.000013 loss: 0.011849, lagrangian_loss: 0.000167, attention_score_distillation_loss: 0.000013 ---------------------------------------------------------------------- time: 2023-07-19 15:14:02 Evaluating: matthews_correlation: 0.5734, eval_loss: 0.6248, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3233, expected_sparsity: 0.3084, expected_sequence_sparsity: 0.8762, target_sparsity: 0.3, step: 9350 lambda_1: -0.3225, lambda_2: 43.6068 lambda_3: 0.0000 train remain: [1. 1. 0.98 0.94 0.96 0.92 0.81 0.52 0.35 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.9, 0.8, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.62, 0.31, 0.11, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111110111111110 11111111110101111100 01111110010001010100 11110110000100000000 10000000000000000000 loss: 0.090054, lagrangian_loss: 0.001045, attention_score_distillation_loss: 0.000013 loss: 0.113706, lagrangian_loss: 0.000803, attention_score_distillation_loss: 0.000013 ETA: 1:12:09 | Epoch 34 finished. Took 64.94 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:14:14 Evaluating: matthews_correlation: 0.5815, eval_loss: 0.6214, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3233, expected_sparsity: 0.3084, expected_sequence_sparsity: 0.8762, target_sparsity: 0.3016, step: 9400 lambda_1: -0.8013, lambda_2: 43.8203 lambda_3: 0.0000 train remain: [1. 1. 0.98 0.94 0.97 0.92 0.81 0.52 0.35 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.9, 0.8, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.62, 0.31, 0.11, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111110111111110 11111111110101111100 01111110010001010100 11110110000100000000 10000000000000000000 loss: 0.032666, lagrangian_loss: 0.004237, attention_score_distillation_loss: 0.000013 loss: 0.003834, lagrangian_loss: 0.002106, attention_score_distillation_loss: 0.000013 ---------------------------------------------------------------------- time: 2023-07-19 15:14:27 Evaluating: matthews_correlation: 0.5785, eval_loss: 0.623, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3233, expected_sparsity: 0.3084, expected_sequence_sparsity: 0.8762, target_sparsity: 0.3032, step: 9450 lambda_1: -1.5130, lambda_2: 44.2031 lambda_3: 0.0000 train remain: [1. 1. 0.98 0.93 0.96 0.91 0.81 0.52 0.34 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.9, 0.8, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.62, 0.31, 0.11, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111110111111110 11111111110101111100 01111110010001010100 11110110000100000000 10000000000000000000 loss: 0.012296, lagrangian_loss: 0.002157, attention_score_distillation_loss: 0.000013 loss: 0.293377, lagrangian_loss: -0.003215, attention_score_distillation_loss: 0.000014 ---------------------------------------------------------------------- time: 2023-07-19 15:14:40 Evaluating: matthews_correlation: 0.5808, eval_loss: 0.6277, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3385, expected_sparsity: 0.3159, expected_sequence_sparsity: 0.8776, target_sparsity: 0.3048, step: 9500 lambda_1: -1.3895, lambda_2: 44.3132 lambda_3: 0.0000 train remain: [1. 1. 0.98 0.93 0.96 0.9 0.8 0.52 0.34 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.85, 0.8, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.73, 0.58, 0.29, 0.1, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101111100 01111110010001010100 11110110000100000000 00000000000000000001 loss: 0.066329, lagrangian_loss: -0.002083, attention_score_distillation_loss: 0.000013 loss: 0.050489, lagrangian_loss: -0.003753, attention_score_distillation_loss: 0.000013 ---------------------------------------------------------------------- time: 2023-07-19 15:14:52 Evaluating: matthews_correlation: 0.5876, eval_loss: 0.615, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3385, expected_sparsity: 0.3159, expected_sequence_sparsity: 0.8776, target_sparsity: 0.3064, step: 9550 lambda_1: -0.7216, lambda_2: 44.5984 lambda_3: 0.0000 train remain: [1. 0.99 0.98 0.92 0.96 0.9 0.8 0.52 0.34 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.85, 0.8, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.73, 0.58, 0.29, 0.1, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101111100 01111110010001010100 11110110000100000000 00000000000000000001 loss: 0.016453, lagrangian_loss: -0.002184, attention_score_distillation_loss: 0.000013 loss: 0.329211, lagrangian_loss: -0.000365, attention_score_distillation_loss: 0.000013 ---------------------------------------------------------------------- time: 2023-07-19 15:15:04 Evaluating: matthews_correlation: 0.5946, eval_loss: 0.6125, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3385, expected_sparsity: 0.3159, expected_sequence_sparsity: 0.8776, target_sparsity: 0.308, step: 9600 lambda_1: -0.5007, lambda_2: 44.8040 lambda_3: 0.0000 train remain: [1. 0.99 0.98 0.92 0.97 0.9 0.79 0.52 0.34 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.85, 0.8, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.73, 0.58, 0.29, 0.1, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101111100 01111110010001010100 11110110000100000000 00000010000000000000 loss: 0.048029, lagrangian_loss: 0.000887, attention_score_distillation_loss: 0.000012 loss: 0.029233, lagrangian_loss: 0.002396, attention_score_distillation_loss: 0.000012 ETA: 1:10:59 | Epoch 35 finished. Took 64.77 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:15:17 Evaluating: matthews_correlation: 0.5965, eval_loss: 0.6156, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3385, expected_sparsity: 0.3159, expected_sequence_sparsity: 0.8776, target_sparsity: 0.3096, step: 9650 lambda_1: -0.7439, lambda_2: 44.9773 lambda_3: 0.0000 train remain: [1. 0.99 0.98 0.93 0.97 0.9 0.79 0.52 0.34 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.85, 0.8, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.73, 0.58, 0.29, 0.1, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101111100 01111110010001010100 11110110000100000000 00000010000000000000 loss: 0.064412, lagrangian_loss: 0.000952, attention_score_distillation_loss: 0.000013 loss: 0.032168, lagrangian_loss: 0.001575, attention_score_distillation_loss: 0.000013 ---------------------------------------------------------------------- time: 2023-07-19 15:15:29 Evaluating: matthews_correlation: 0.5811, eval_loss: 0.6202, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3385, expected_sparsity: 0.3159, expected_sequence_sparsity: 0.8776, target_sparsity: 0.3112, step: 9700 lambda_1: -1.6469, lambda_2: 45.5091 lambda_3: 0.0000 train remain: [1. 0.99 0.98 0.93 0.97 0.9 0.79 0.52 0.34 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.85, 0.8, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.73, 0.58, 0.29, 0.1, 0.01] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101111100 01111110010001010100 11110110000100000000 00000000000000000001 loss: 0.045704, lagrangian_loss: 0.012517, attention_score_distillation_loss: 0.000012 loss: 0.023733, lagrangian_loss: 0.002945, attention_score_distillation_loss: 0.000013 ---------------------------------------------------------------------- time: 2023-07-19 15:15:42 Evaluating: matthews_correlation: 0.5943, eval_loss: 0.6167, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3385, expected_sparsity: 0.3206, expected_sequence_sparsity: 0.8784, target_sparsity: 0.3128, step: 9750 lambda_1: -2.6658, lambda_2: 46.1051 lambda_3: 0.0000 train remain: [1. 0.99 0.97 0.93 0.97 0.89 0.78 0.52 0.33 0.05] infer remain: [1.0, 1.0, 1.0, 0.9, 0.95, 0.85, 0.75, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.73, 0.55, 0.27, 0.1, 0.0] 11111111111111111111 11111111111111111111 11111111111111111111 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101110100 01111110010001010100 11110110000100000000 00000000000000000001 loss: 0.022846, lagrangian_loss: 0.006482, attention_score_distillation_loss: 0.000012 loss: 0.011755, lagrangian_loss: -0.002611, attention_score_distillation_loss: 0.000012 ---------------------------------------------------------------------- time: 2023-07-19 15:15:54 Evaluating: matthews_correlation: 0.5946, eval_loss: 0.6224, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3688, expected_sparsity: 0.3371, expected_sequence_sparsity: 0.8814, target_sparsity: 0.3144, step: 9800 lambda_1: -2.8000, lambda_2: 46.3340 lambda_3: 0.0000 train remain: [1. 0.99 0.97 0.93 0.97 0.88 0.77 0.52 0.33 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.95, 0.85, 0.75, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.81, 0.69, 0.52, 0.26, 0.09, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101110100 01111110010001010100 11110110000100000000 00000000000000000001 loss: 0.015101, lagrangian_loss: -0.004354, attention_score_distillation_loss: 0.000012 loss: 0.081868, lagrangian_loss: -0.008533, attention_score_distillation_loss: 0.000012 ---------------------------------------------------------------------- time: 2023-07-19 15:16:07 Evaluating: matthews_correlation: 0.5844, eval_loss: 0.6183, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3688, expected_sparsity: 0.3371, expected_sequence_sparsity: 0.8814, target_sparsity: 0.316, step: 9850 lambda_1: -2.2175, lambda_2: 46.5993 lambda_3: 0.0000 train remain: [1. 0.99 0.97 0.92 0.96 0.88 0.77 0.52 0.33 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.95, 0.85, 0.75, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.81, 0.69, 0.52, 0.26, 0.09, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101110100 01111110010001010100 11110110000100000000 00000000000000000001 loss: 0.019858, lagrangian_loss: -0.007236, attention_score_distillation_loss: 0.000012 loss: 0.037109, lagrangian_loss: -0.002074, attention_score_distillation_loss: 0.000011 ---------------------------------------------------------------------- time: 2023-07-19 15:16:19 Evaluating: matthews_correlation: 0.5711, eval_loss: 0.6358, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3688, expected_sparsity: 0.3371, expected_sequence_sparsity: 0.8814, target_sparsity: 0.3176, step: 9900 lambda_1: -1.0785, lambda_2: 47.3019 lambda_3: 0.0000 train remain: [1. 0.99 0.97 0.92 0.96 0.88 0.77 0.52 0.33 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.95, 0.85, 0.75, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.81, 0.69, 0.52, 0.26, 0.09, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101110100 01111110010001010100 11010110000101000000 00000000000000000001 loss: 0.043951, lagrangian_loss: -0.003698, attention_score_distillation_loss: 0.000011 ETA: 1:09:59 | Epoch 36 finished. Took 70.19 seconds. loss: 0.063070, lagrangian_loss: -0.000224, attention_score_distillation_loss: 0.000011 ---------------------------------------------------------------------- time: 2023-07-19 15:16:31 Evaluating: matthews_correlation: 0.5846, eval_loss: 0.6253, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3688, expected_sparsity: 0.3371, expected_sequence_sparsity: 0.8814, target_sparsity: 0.3192, step: 9950 lambda_1: 0.0275, lambda_2: 48.0384 lambda_3: 0.0000 train remain: [1. 0.99 0.97 0.92 0.97 0.88 0.77 0.52 0.33 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.95, 0.85, 0.75, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.81, 0.69, 0.52, 0.26, 0.09, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101110100 01111110010001010100 11010110000100000100 10000000000000000000 loss: 0.006287, lagrangian_loss: 0.000010, attention_score_distillation_loss: 0.000011 loss: 0.046565, lagrangian_loss: -0.000241, attention_score_distillation_loss: 0.000012 ---------------------------------------------------------------------- time: 2023-07-19 15:16:44 Evaluating: matthews_correlation: 0.582, eval_loss: 0.6334, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3688, expected_sparsity: 0.3371, expected_sequence_sparsity: 0.8814, target_sparsity: 0.3208, step: 10000 lambda_1: 0.1026, lambda_2: 48.2916 lambda_3: 0.0000 train remain: [1. 0.99 0.97 0.92 0.97 0.88 0.77 0.52 0.33 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.95, 0.85, 0.75, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.81, 0.69, 0.52, 0.26, 0.09, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101110100 01111110010001010100 11010110000100000100 00000000010000000000 loss: 0.124736, lagrangian_loss: -0.000048, attention_score_distillation_loss: 0.000012 loss: 0.008960, lagrangian_loss: 0.007439, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-19 15:16:56 Evaluating: matthews_correlation: 0.5893, eval_loss: 0.6143, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3688, expected_sparsity: 0.3371, expected_sequence_sparsity: 0.8814, target_sparsity: 0.3224, step: 10050 lambda_1: -1.1156, lambda_2: 49.1764 lambda_3: 0.0000 train remain: [1. 0.99 0.97 0.92 0.97 0.88 0.77 0.52 0.33 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.95, 0.85, 0.75, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.81, 0.69, 0.52, 0.26, 0.09, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101110100 01111110010001010100 11010110000100100000 00000000010000000000 loss: 0.032238, lagrangian_loss: 0.014232, attention_score_distillation_loss: 0.000011 loss: 0.014785, lagrangian_loss: 0.009492, attention_score_distillation_loss: 0.000011 ---------------------------------------------------------------------- time: 2023-07-19 15:17:09 Evaluating: matthews_correlation: 0.58, eval_loss: 0.6311, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3688, expected_sparsity: 0.3371, expected_sequence_sparsity: 0.8814, target_sparsity: 0.324, step: 10100 lambda_1: -2.5759, lambda_2: 50.3570 lambda_3: 0.0000 train remain: [1. 0.99 0.97 0.92 0.97 0.87 0.76 0.52 0.33 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.95, 0.85, 0.75, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.81, 0.69, 0.52, 0.26, 0.09, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101110100 01111110010001010100 11010110000100000001 10000000000000000000 loss: 0.010777, lagrangian_loss: 0.016172, attention_score_distillation_loss: 0.000011 loss: 0.019570, lagrangian_loss: 0.028398, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-19 15:17:21 Evaluating: matthews_correlation: 0.5741, eval_loss: 0.6307, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3688, expected_sparsity: 0.3371, expected_sequence_sparsity: 0.8814, target_sparsity: 0.3256, step: 10150 lambda_1: -3.8025, lambda_2: 51.2585 lambda_3: 0.0000 train remain: [1. 0.99 0.97 0.91 0.97 0.87 0.76 0.52 0.33 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.95, 0.85, 0.75, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.81, 0.69, 0.52, 0.26, 0.09, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101110100 01111110010001010100 11010110000100000001 00000000000000000001 loss: 0.011054, lagrangian_loss: 0.000075, attention_score_distillation_loss: 0.000011 loss: 0.010084, lagrangian_loss: 0.015007, attention_score_distillation_loss: 0.000011 ETA: 1:08:49 | Epoch 37 finished. Took 64.89 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:17:34 Evaluating: matthews_correlation: 0.5859, eval_loss: 0.6249, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3688, expected_sparsity: 0.3371, expected_sequence_sparsity: 0.8814, target_sparsity: 0.3272, step: 10200 lambda_1: -4.5690, lambda_2: 51.7695 lambda_3: 0.0000 train remain: [1. 0.99 0.97 0.91 0.97 0.87 0.76 0.52 0.33 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.95, 0.85, 0.75, 0.5, 0.35, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.81, 0.69, 0.52, 0.26, 0.09, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101110100 01111110010001010100 11010110000100000001 00000000000000000001 loss: 0.063206, lagrangian_loss: 0.007598, attention_score_distillation_loss: 0.000011 loss: 0.018122, lagrangian_loss: -0.013536, attention_score_distillation_loss: 0.000011 ---------------------------------------------------------------------- time: 2023-07-19 15:17:46 Evaluating: matthews_correlation: 0.5933, eval_loss: 0.621, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3688, expected_sparsity: 0.3382, expected_sequence_sparsity: 0.8816, target_sparsity: 0.3289, step: 10250 lambda_1: -4.6189, lambda_2: 51.9711 lambda_3: 0.0000 train remain: [0.99 0.99 0.96 0.91 0.96 0.87 0.76 0.52 0.29 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.95, 0.85, 0.75, 0.5, 0.3, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.81, 0.69, 0.52, 0.26, 0.08, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101110100 01111110010001010100 11010110000100000000 00000000000000000001 loss: 0.471340, lagrangian_loss: -0.022614, attention_score_distillation_loss: 0.000011 loss: 0.012043, lagrangian_loss: 0.001318, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-19 15:17:59 Evaluating: matthews_correlation: 0.5866, eval_loss: 0.6168, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3688, expected_sparsity: 0.3382, expected_sequence_sparsity: 0.8816, target_sparsity: 0.3305, step: 10300 lambda_1: -3.9699, lambda_2: 52.3606 lambda_3: 0.0000 train remain: [0.99 0.99 0.96 0.91 0.96 0.86 0.76 0.52 0.28 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.95, 0.85, 0.75, 0.5, 0.3, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.81, 0.69, 0.52, 0.26, 0.08, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000001 00000000000000000001 loss: 0.011994, lagrangian_loss: 0.000532, attention_score_distillation_loss: 0.000010 loss: 0.068330, lagrangian_loss: -0.011865, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-19 15:18:11 Evaluating: matthews_correlation: 0.5789, eval_loss: 0.6277, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3688, expected_sparsity: 0.3382, expected_sequence_sparsity: 0.8816, target_sparsity: 0.3321, step: 10350 lambda_1: -3.0979, lambda_2: 52.9358 lambda_3: 0.0000 train remain: [0.99 0.99 0.96 0.91 0.96 0.86 0.76 0.52 0.28 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.95, 0.85, 0.75, 0.5, 0.3, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.81, 0.69, 0.52, 0.26, 0.08, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000001 00000000000000000001 loss: 0.029443, lagrangian_loss: -0.020771, attention_score_distillation_loss: 0.000011 loss: 0.079617, lagrangian_loss: -0.006477, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-19 15:18:24 Evaluating: matthews_correlation: 0.5797, eval_loss: 0.6226, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3688, expected_sparsity: 0.3382, expected_sequence_sparsity: 0.8816, target_sparsity: 0.3337, step: 10400 lambda_1: -2.1803, lambda_2: 53.5422 lambda_3: 0.0000 train remain: [0.99 0.99 0.96 0.91 0.96 0.86 0.76 0.52 0.28 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.95, 0.85, 0.75, 0.5, 0.3, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.81, 0.69, 0.52, 0.26, 0.08, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101110100 01111110010001010100 10010110000101000000 10000000000000000000 loss: 0.220605, lagrangian_loss: -0.004424, attention_score_distillation_loss: 0.000010 loss: 0.070341, lagrangian_loss: -0.000159, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-19 15:18:36 Evaluating: matthews_correlation: 0.5847, eval_loss: 0.6151, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3688, expected_sparsity: 0.3382, expected_sequence_sparsity: 0.8816, target_sparsity: 0.3353, step: 10450 lambda_1: -1.7812, lambda_2: 53.7952 lambda_3: 0.0000 train remain: [0.99 0.99 0.96 0.91 0.95 0.86 0.76 0.52 0.28 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.95, 0.85, 0.75, 0.5, 0.3, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.81, 0.69, 0.52, 0.26, 0.08, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000001 10000000000000000000 loss: 0.101031, lagrangian_loss: -0.003652, attention_score_distillation_loss: 0.000010 ETA: 1:07:49 | Epoch 38 finished. Took 70.33 seconds. loss: 0.125659, lagrangian_loss: 0.002010, attention_score_distillation_loss: 0.000009 ---------------------------------------------------------------------- time: 2023-07-19 15:18:49 Evaluating: matthews_correlation: 0.5722, eval_loss: 0.6353, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3688, expected_sparsity: 0.3382, expected_sequence_sparsity: 0.8816, target_sparsity: 0.3369, step: 10500 lambda_1: -1.7080, lambda_2: 53.9878 lambda_3: 0.0000 train remain: [0.99 0.99 0.96 0.91 0.95 0.86 0.76 0.52 0.28 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.95, 0.85, 0.75, 0.5, 0.3, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.81, 0.69, 0.52, 0.26, 0.08, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111110 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000001 00000000000000000001 loss: 0.038167, lagrangian_loss: -0.003590, attention_score_distillation_loss: 0.000010 loss: 0.019287, lagrangian_loss: -0.000287, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-19 15:19:01 Evaluating: matthews_correlation: 0.5797, eval_loss: 0.6121, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.384, expected_sparsity: 0.3475, expected_sequence_sparsity: 0.8833, target_sparsity: 0.3385, step: 10550 lambda_1: -1.5714, lambda_2: 54.1733 lambda_3: 0.0000 train remain: [0.99 0.99 0.96 0.91 0.94 0.86 0.76 0.52 0.28 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.3, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.49, 0.25, 0.07, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000001 10000000000000000000 loss: 0.027122, lagrangian_loss: 0.004806, attention_score_distillation_loss: 0.000009 loss: 0.307344, lagrangian_loss: -0.000576, attention_score_distillation_loss: 0.000010 ---------------------------------------------------------------------- time: 2023-07-19 15:19:14 Evaluating: matthews_correlation: 0.5829, eval_loss: 0.597, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.384, expected_sparsity: 0.3475, expected_sequence_sparsity: 0.8833, target_sparsity: 0.3401, step: 10600 lambda_1: -1.1777, lambda_2: 54.4723 lambda_3: 0.0000 train remain: [0.99 0.99 0.96 0.91 0.93 0.86 0.76 0.52 0.28 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.3, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.49, 0.25, 0.07, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10011110000100000000 00000010000000000000 loss: 0.070496, lagrangian_loss: -0.003314, attention_score_distillation_loss: 0.000010 loss: 0.023249, lagrangian_loss: -0.000422, attention_score_distillation_loss: 0.000009 ---------------------------------------------------------------------- time: 2023-07-19 15:19:26 Evaluating: matthews_correlation: 0.5873, eval_loss: 0.5982, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.384, expected_sparsity: 0.3475, expected_sequence_sparsity: 0.8833, target_sparsity: 0.3417, step: 10650 lambda_1: -0.8891, lambda_2: 54.7604 lambda_3: 0.0000 train remain: [0.99 0.99 0.96 0.91 0.92 0.86 0.76 0.52 0.28 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.3, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.49, 0.25, 0.07, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10010110000101000000 10000000000000000000 loss: 0.021091, lagrangian_loss: -0.001044, attention_score_distillation_loss: 0.000009 loss: 0.037529, lagrangian_loss: -0.001429, attention_score_distillation_loss: 0.000009 ---------------------------------------------------------------------- time: 2023-07-19 15:19:39 Evaluating: matthews_correlation: 0.5918, eval_loss: 0.6008, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.384, expected_sparsity: 0.3475, expected_sequence_sparsity: 0.8833, target_sparsity: 0.3433, step: 10700 lambda_1: -1.1288, lambda_2: 55.0707 lambda_3: 0.0000 train remain: [0.99 0.98 0.96 0.91 0.92 0.86 0.76 0.52 0.28 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.3, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.49, 0.25, 0.07, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000001 00000000000000000001 loss: 0.150048, lagrangian_loss: -0.001732, attention_score_distillation_loss: 0.000009 ETA: 1:06:39 | Epoch 39 finished. Took 64.98 seconds. loss: 0.103664, lagrangian_loss: 0.003813, attention_score_distillation_loss: 0.000009 ---------------------------------------------------------------------- time: 2023-07-19 15:19:51 Evaluating: matthews_correlation: 0.6026, eval_loss: 0.6023, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.384, expected_sparsity: 0.3475, expected_sequence_sparsity: 0.8833, target_sparsity: 0.3449, step: 10750 lambda_1: -1.8887, lambda_2: 55.6486 lambda_3: 0.0000 train remain: [0.99 0.98 0.96 0.91 0.92 0.86 0.76 0.52 0.28 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.3, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.49, 0.25, 0.07, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000001 00000000000001000000 loss: 0.003596, lagrangian_loss: 0.004567, attention_score_distillation_loss: 0.000009 loss: 0.013089, lagrangian_loss: 0.011014, attention_score_distillation_loss: 0.000008 ---------------------------------------------------------------------- time: 2023-07-19 15:20:04 Evaluating: matthews_correlation: 0.5831, eval_loss: 0.6105, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.384, expected_sparsity: 0.3475, expected_sequence_sparsity: 0.8833, target_sparsity: 0.3465, step: 10800 lambda_1: -2.3536, lambda_2: 55.9499 lambda_3: 0.0000 train remain: [0.99 0.98 0.96 0.91 0.92 0.86 0.76 0.52 0.28 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.3, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.49, 0.25, 0.07, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000001 00000000000001000000 loss: 0.183001, lagrangian_loss: 0.011979, attention_score_distillation_loss: 0.000008 loss: 0.114059, lagrangian_loss: 0.007792, attention_score_distillation_loss: 0.000009 ---------------------------------------------------------------------- time: 2023-07-19 15:20:16 Evaluating: matthews_correlation: 0.5853, eval_loss: 0.6124, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.384, expected_sparsity: 0.3475, expected_sequence_sparsity: 0.8833, target_sparsity: 0.3481, step: 10850 lambda_1: -3.1669, lambda_2: 56.5291 lambda_3: 0.0000 train remain: [0.99 0.98 0.96 0.91 0.92 0.86 0.76 0.52 0.28 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.3, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.49, 0.25, 0.07, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000001 00001000000000000000 loss: 0.145679, lagrangian_loss: 0.019641, attention_score_distillation_loss: 0.000008 loss: 0.083158, lagrangian_loss: 0.013298, attention_score_distillation_loss: 0.000008 ---------------------------------------------------------------------- time: 2023-07-19 15:20:28 Evaluating: matthews_correlation: 0.5899, eval_loss: 0.6107, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.384, expected_sparsity: 0.3475, expected_sequence_sparsity: 0.8833, target_sparsity: 0.3497, step: 10900 lambda_1: -4.1563, lambda_2: 57.2571 lambda_3: 0.0000 train remain: [0.99 0.98 0.96 0.91 0.92 0.86 0.76 0.52 0.28 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.3, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.49, 0.25, 0.07, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000001 00000000000000000001 loss: 0.031650, lagrangian_loss: 0.013936, attention_score_distillation_loss: 0.000008 loss: 0.020002, lagrangian_loss: 0.009919, attention_score_distillation_loss: 0.000008 ---------------------------------------------------------------------- time: 2023-07-19 15:20:41 Evaluating: matthews_correlation: 0.5946, eval_loss: 0.6051, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.384, expected_sparsity: 0.3485, expected_sequence_sparsity: 0.8835, target_sparsity: 0.3513, step: 10950 lambda_1: -4.7788, lambda_2: 57.7119 lambda_3: 0.0000 train remain: [0.99 0.97 0.96 0.91 0.91 0.86 0.76 0.52 0.27 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.25, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.49, 0.25, 0.06, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000000 00000000000000000001 loss: 0.012259, lagrangian_loss: -0.004244, attention_score_distillation_loss: 0.000008 loss: 0.043142, lagrangian_loss: -0.019464, attention_score_distillation_loss: 0.000008 ETA: 1:05:30 | Epoch 40 finished. Took 64.74 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:20:53 Evaluating: matthews_correlation: 0.5678, eval_loss: 0.6425, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.3864, expected_sequence_sparsity: 0.8904, target_sparsity: 0.3529, step: 11000 lambda_1: -3.8142, lambda_2: 58.5307 lambda_3: 0.0000 train remain: [0.99 0.96 0.96 0.91 0.91 0.86 0.76 0.52 0.26 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.25, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.59, 0.44, 0.22, 0.06, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000000 00000010000000000000 loss: 0.004166, lagrangian_loss: -0.009546, attention_score_distillation_loss: 0.000008 loss: 0.050313, lagrangian_loss: -0.018738, attention_score_distillation_loss: 0.000008 ---------------------------------------------------------------------- time: 2023-07-19 15:21:06 Evaluating: matthews_correlation: 0.5738, eval_loss: 0.6392, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.3864, expected_sequence_sparsity: 0.8904, target_sparsity: 0.3545, step: 11050 lambda_1: -2.1761, lambda_2: 60.2459 lambda_3: 0.0000 train remain: [0.99 0.96 0.96 0.91 0.91 0.86 0.76 0.52 0.25 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.25, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.59, 0.44, 0.22, 0.06, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000000 00000010000000000000 loss: 0.069536, lagrangian_loss: -0.009903, attention_score_distillation_loss: 0.000008 loss: 0.210345, lagrangian_loss: -0.003467, attention_score_distillation_loss: 0.000008 ---------------------------------------------------------------------- time: 2023-07-19 15:21:18 Evaluating: matthews_correlation: 0.5781, eval_loss: 0.642, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.3864, expected_sequence_sparsity: 0.8904, target_sparsity: 0.3561, step: 11100 lambda_1: -0.9293, lambda_2: 61.3289 lambda_3: 0.0000 train remain: [0.99 0.96 0.95 0.91 0.91 0.86 0.76 0.52 0.24 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.25, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.59, 0.44, 0.22, 0.06, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000000 00000010000000000000 loss: 0.030342, lagrangian_loss: -0.002247, attention_score_distillation_loss: 0.000008 loss: 0.076960, lagrangian_loss: 0.001929, attention_score_distillation_loss: 0.000007 ---------------------------------------------------------------------- time: 2023-07-19 15:21:31 Evaluating: matthews_correlation: 0.5726, eval_loss: 0.6442, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.3864, expected_sequence_sparsity: 0.8904, target_sparsity: 0.3577, step: 11150 lambda_1: -0.3720, lambda_2: 61.7450 lambda_3: 0.0000 train remain: [0.99 0.96 0.95 0.91 0.91 0.86 0.76 0.52 0.24 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.25, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.59, 0.44, 0.22, 0.06, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000000 00000010000000000000 loss: 0.498850, lagrangian_loss: -0.000510, attention_score_distillation_loss: 0.000008 loss: 0.102209, lagrangian_loss: 0.000548, attention_score_distillation_loss: 0.000007 ---------------------------------------------------------------------- time: 2023-07-19 15:21:43 Evaluating: matthews_correlation: 0.5741, eval_loss: 0.6412, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.3864, expected_sequence_sparsity: 0.8904, target_sparsity: 0.3593, step: 11200 lambda_1: -0.3603, lambda_2: 61.9156 lambda_3: 0.0000 train remain: [0.99 0.96 0.95 0.91 0.91 0.86 0.76 0.52 0.24 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.25, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.59, 0.44, 0.22, 0.06, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000000 00000000000000100000 loss: 0.172002, lagrangian_loss: 0.001223, attention_score_distillation_loss: 0.000007 loss: 0.008031, lagrangian_loss: 0.007846, attention_score_distillation_loss: 0.000007 ---------------------------------------------------------------------- time: 2023-07-19 15:21:56 Evaluating: matthews_correlation: 0.5782, eval_loss: 0.6062, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.384, expected_sparsity: 0.3485, expected_sequence_sparsity: 0.8835, target_sparsity: 0.3609, step: 11250 lambda_1: -1.2812, lambda_2: 62.6952 lambda_3: 0.0000 train remain: [1. 0.96 0.96 0.91 0.91 0.86 0.76 0.52 0.24 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.25, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.49, 0.25, 0.06, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10010110000100000000 00000010000000000000 loss: 0.015344, lagrangian_loss: 0.018142, attention_score_distillation_loss: 0.000007 ETA: 1:04:28 | Epoch 41 finished. Took 70.18 seconds. loss: 0.008417, lagrangian_loss: 0.023353, attention_score_distillation_loss: 0.000007 ---------------------------------------------------------------------- time: 2023-07-19 15:22:08 Evaluating: matthews_correlation: 0.5897, eval_loss: 0.609, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.384, expected_sparsity: 0.3495, expected_sequence_sparsity: 0.8837, target_sparsity: 0.3625, step: 11300 lambda_1: -3.0112, lambda_2: 64.6220 lambda_3: 0.0000 train remain: [0.99 0.96 0.96 0.91 0.91 0.85 0.76 0.52 0.22 0.05] infer remain: [1.0, 1.0, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.49, 0.25, 0.05, 0.0] 11111111111111111111 11111111111111111111 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10010010000100000000 10000000000000000000 loss: 0.024875, lagrangian_loss: 0.017281, attention_score_distillation_loss: 0.000007 loss: 0.068658, lagrangian_loss: 0.014095, attention_score_distillation_loss: 0.000007 ---------------------------------------------------------------------- time: 2023-07-19 15:22:21 Evaluating: matthews_correlation: 0.56, eval_loss: 0.6452, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.3873, expected_sequence_sparsity: 0.8906, target_sparsity: 0.3642, step: 11350 lambda_1: -3.9856, lambda_2: 65.4362 lambda_3: 0.0000 train remain: [0.99 0.96 0.95 0.91 0.91 0.83 0.76 0.52 0.2 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.85, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.59, 0.44, 0.22, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111111100 11111111110101110100 01111110010001010100 10010010000100000000 00000000000000000001 loss: 0.096230, lagrangian_loss: 0.002407, attention_score_distillation_loss: 0.000007 loss: 0.028507, lagrangian_loss: -0.005901, attention_score_distillation_loss: 0.000007 ---------------------------------------------------------------------- time: 2023-07-19 15:22:33 Evaluating: matthews_correlation: 0.5803, eval_loss: 0.6257, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.393, expected_sequence_sparsity: 0.8917, target_sparsity: 0.3658, step: 11400 lambda_1: -3.9427, lambda_2: 65.6684 lambda_3: 0.0000 train remain: [0.99 0.95 0.96 0.91 0.91 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.55, 0.42, 0.21, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000100000000 10000000000000000000 loss: 0.014246, lagrangian_loss: -0.029202, attention_score_distillation_loss: 0.000007 loss: 0.016133, lagrangian_loss: -0.009803, attention_score_distillation_loss: 0.000007 ---------------------------------------------------------------------- time: 2023-07-19 15:22:45 Evaluating: matthews_correlation: 0.5705, eval_loss: 0.6498, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.393, expected_sequence_sparsity: 0.8917, target_sparsity: 0.3674, step: 11450 lambda_1: -3.0358, lambda_2: 66.3676 lambda_3: 0.0000 train remain: [0.99 0.94 0.95 0.91 0.91 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.55, 0.42, 0.21, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000100000000 00000000000000000001 loss: 0.045213, lagrangian_loss: -0.002574, attention_score_distillation_loss: 0.000006 loss: 0.030542, lagrangian_loss: -0.007863, attention_score_distillation_loss: 0.000006 ---------------------------------------------------------------------- time: 2023-07-19 15:22:58 Evaluating: matthews_correlation: 0.5792, eval_loss: 0.6416, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.393, expected_sequence_sparsity: 0.8917, target_sparsity: 0.369, step: 11500 lambda_1: -2.1497, lambda_2: 67.0260 lambda_3: 0.0000 train remain: [0.99 0.94 0.95 0.91 0.91 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.55, 0.42, 0.21, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000100000000 00000000000000000001 loss: 0.013651, lagrangian_loss: -0.004873, attention_score_distillation_loss: 0.000006 ETA: 1:03:19 | Epoch 42 finished. Took 64.86 seconds. loss: 0.037415, lagrangian_loss: 0.002574, attention_score_distillation_loss: 0.000006 ---------------------------------------------------------------------- time: 2023-07-19 15:23:10 Evaluating: matthews_correlation: 0.5777, eval_loss: 0.6474, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.393, expected_sequence_sparsity: 0.8917, target_sparsity: 0.3706, step: 11550 lambda_1: -1.7268, lambda_2: 67.4514 lambda_3: 0.0000 train remain: [0.99 0.95 0.95 0.91 0.91 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.55, 0.42, 0.21, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111110100 11111111110101110100 01111110010001010100 10010110000000000000 00000000000000000001 loss: 0.100746, lagrangian_loss: -0.001641, attention_score_distillation_loss: 0.000006 loss: 0.299178, lagrangian_loss: 0.004116, attention_score_distillation_loss: 0.000006 ---------------------------------------------------------------------- time: 2023-07-19 15:23:23 Evaluating: matthews_correlation: 0.5818, eval_loss: 0.6312, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.393, expected_sequence_sparsity: 0.8917, target_sparsity: 0.3722, step: 11600 lambda_1: -1.8492, lambda_2: 67.7121 lambda_3: 0.0000 train remain: [0.99 0.95 0.95 0.91 0.91 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.55, 0.42, 0.21, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 00000000000000000001 loss: 0.018999, lagrangian_loss: 0.003573, attention_score_distillation_loss: 0.000006 loss: 0.022343, lagrangian_loss: -0.004956, attention_score_distillation_loss: 0.000006 ---------------------------------------------------------------------- time: 2023-07-19 15:23:35 Evaluating: matthews_correlation: 0.5792, eval_loss: 0.6434, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.393, expected_sequence_sparsity: 0.8917, target_sparsity: 0.3738, step: 11650 lambda_1: -1.9964, lambda_2: 67.9401 lambda_3: 0.0000 train remain: [0.99 0.94 0.95 0.91 0.91 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.55, 0.42, 0.21, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111110100 11111111110101110100 01111110010001010100 10010010010000000000 00000000000000000001 loss: 0.353658, lagrangian_loss: -0.002296, attention_score_distillation_loss: 0.000006 loss: 0.022852, lagrangian_loss: 0.003163, attention_score_distillation_loss: 0.000006 ---------------------------------------------------------------------- time: 2023-07-19 15:23:48 Evaluating: matthews_correlation: 0.5576, eval_loss: 0.6459, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.393, expected_sequence_sparsity: 0.8917, target_sparsity: 0.3754, step: 11700 lambda_1: -2.0283, lambda_2: 68.1960 lambda_3: 0.0000 train remain: [0.99 0.94 0.95 0.91 0.91 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.55, 0.42, 0.21, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 10000000000000000000 loss: 0.019876, lagrangian_loss: 0.012601, attention_score_distillation_loss: 0.000005 loss: 0.035467, lagrangian_loss: 0.005602, attention_score_distillation_loss: 0.000005 ---------------------------------------------------------------------- time: 2023-07-19 15:24:00 Evaluating: matthews_correlation: 0.5766, eval_loss: 0.6289, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.393, expected_sequence_sparsity: 0.8917, target_sparsity: 0.377, step: 11750 lambda_1: -2.1977, lambda_2: 68.4447 lambda_3: 0.0000 train remain: [0.99 0.93 0.95 0.91 0.91 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.55, 0.42, 0.21, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 10000000000000000000 loss: 0.009989, lagrangian_loss: -0.000194, attention_score_distillation_loss: 0.000006 loss: 0.025707, lagrangian_loss: 0.004378, attention_score_distillation_loss: 0.000005 ETA: 1:02:10 | Epoch 43 finished. Took 64.88 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:24:13 Evaluating: matthews_correlation: 0.5796, eval_loss: 0.6369, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.393, expected_sequence_sparsity: 0.8917, target_sparsity: 0.3786, step: 11800 lambda_1: -2.6291, lambda_2: 68.7659 lambda_3: 0.0000 train remain: [0.99 0.94 0.95 0.91 0.9 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.55, 0.42, 0.21, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 00000000000001000000 loss: 0.046319, lagrangian_loss: 0.010986, attention_score_distillation_loss: 0.000005 loss: 0.024056, lagrangian_loss: 0.004503, attention_score_distillation_loss: 0.000005 ---------------------------------------------------------------------- time: 2023-07-19 15:24:25 Evaluating: matthews_correlation: 0.5689, eval_loss: 0.6455, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.393, expected_sequence_sparsity: 0.8917, target_sparsity: 0.3802, step: 11850 lambda_1: -3.3519, lambda_2: 69.3145 lambda_3: 0.0000 train remain: [0.99 0.93 0.95 0.91 0.9 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.55, 0.42, 0.21, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 00000000000000000001 loss: 0.153224, lagrangian_loss: 0.009713, attention_score_distillation_loss: 0.000005 loss: 0.053057, lagrangian_loss: 0.006700, attention_score_distillation_loss: 0.000005 ---------------------------------------------------------------------- time: 2023-07-19 15:24:38 Evaluating: matthews_correlation: 0.5676, eval_loss: 0.638, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4219, expected_sparsity: 0.393, expected_sequence_sparsity: 0.8917, target_sparsity: 0.3818, step: 11900 lambda_1: -3.4862, lambda_2: 69.5663 lambda_3: 0.0000 train remain: [0.99 0.93 0.94 0.91 0.9 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.95, 0.9, 0.9, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.86, 0.77, 0.69, 0.55, 0.42, 0.21, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111111111111110 11111111111111111100 11111111111111111100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 10000000000000000000 loss: 0.111515, lagrangian_loss: -0.011811, attention_score_distillation_loss: 0.000005 loss: 0.039089, lagrangian_loss: -0.017508, attention_score_distillation_loss: 0.000005 ---------------------------------------------------------------------- time: 2023-07-19 15:24:50 Evaluating: matthews_correlation: 0.56, eval_loss: 0.6407, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4295, expected_sparsity: 0.407, expected_sequence_sparsity: 0.8942, target_sparsity: 0.3834, step: 11950 lambda_1: -2.8213, lambda_2: 70.0731 lambda_3: 0.0000 train remain: [0.99 0.93 0.93 0.91 0.89 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.9, 0.9, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.73, 0.66, 0.52, 0.39, 0.2, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111111111111100 11111111111111111100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 00000010000000000000 loss: 0.071014, lagrangian_loss: -0.016809, attention_score_distillation_loss: 0.000005 loss: 0.018529, lagrangian_loss: -0.000369, attention_score_distillation_loss: 0.000005 ---------------------------------------------------------------------- time: 2023-07-19 15:25:03 Evaluating: matthews_correlation: 0.566, eval_loss: 0.6381, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4295, expected_sparsity: 0.407, expected_sequence_sparsity: 0.8942, target_sparsity: 0.385, step: 12000 lambda_1: -1.4928, lambda_2: 71.3027 lambda_3: 0.0000 train remain: [0.99 0.93 0.93 0.91 0.89 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.9, 0.9, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.73, 0.66, 0.52, 0.39, 0.2, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111111111111100 11111111111111111100 11111111110111110100 11111111110101110100 01111110010001010100 10010010100000000000 00000000010000000000 loss: 0.133920, lagrangian_loss: -0.005041, attention_score_distillation_loss: 0.000005 loss: 0.137888, lagrangian_loss: -0.003221, attention_score_distillation_loss: 0.000005 ---------------------------------------------------------------------- time: 2023-07-19 15:25:15 Evaluating: matthews_correlation: 0.5537, eval_loss: 0.6526, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4295, expected_sparsity: 0.407, expected_sequence_sparsity: 0.8942, target_sparsity: 0.3866, step: 12050 lambda_1: -0.6810, lambda_2: 71.9674 lambda_3: 0.0000 train remain: [0.99 0.93 0.93 0.91 0.89 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.9, 0.9, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.73, 0.66, 0.52, 0.39, 0.2, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111111111111100 11111111111111111100 11111111110111110100 11111111110101110100 01111110010001010100 10010110000000000000 10000000000000000000 loss: 0.089893, lagrangian_loss: -0.001242, attention_score_distillation_loss: 0.000005 ETA: 1:01:08 | Epoch 44 finished. Took 70.31 seconds. loss: 0.081127, lagrangian_loss: -0.001322, attention_score_distillation_loss: 0.000005 ---------------------------------------------------------------------- time: 2023-07-19 15:25:28 Evaluating: matthews_correlation: 0.57, eval_loss: 0.6481, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.437, expected_sparsity: 0.4146, expected_sequence_sparsity: 0.8956, target_sparsity: 0.3882, step: 12100 lambda_1: -0.6447, lambda_2: 72.3218 lambda_3: 0.0000 train remain: [0.99 0.93 0.93 0.91 0.89 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.9, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.73, 0.62, 0.5, 0.37, 0.19, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111111111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010011000000000000 10000000000000000000 loss: 0.005346, lagrangian_loss: 0.004015, attention_score_distillation_loss: 0.000004 loss: 0.012497, lagrangian_loss: 0.007708, attention_score_distillation_loss: 0.000004 ---------------------------------------------------------------------- time: 2023-07-19 15:25:40 Evaluating: matthews_correlation: 0.5643, eval_loss: 0.6429, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.437, expected_sparsity: 0.4146, expected_sequence_sparsity: 0.8956, target_sparsity: 0.3898, step: 12150 lambda_1: -1.1094, lambda_2: 72.7983 lambda_3: 0.0000 train remain: [0.99 0.93 0.93 0.91 0.88 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.9, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.73, 0.62, 0.5, 0.37, 0.19, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111111111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010011000000000000 10000000000000000000 loss: 0.012617, lagrangian_loss: 0.003528, attention_score_distillation_loss: 0.000004 loss: 0.242041, lagrangian_loss: -0.001657, attention_score_distillation_loss: 0.000004 ---------------------------------------------------------------------- time: 2023-07-19 15:25:53 Evaluating: matthews_correlation: 0.5552, eval_loss: 0.6541, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.437, expected_sparsity: 0.4146, expected_sequence_sparsity: 0.8956, target_sparsity: 0.3914, step: 12200 lambda_1: -1.5886, lambda_2: 73.2272 lambda_3: 0.0000 train remain: [0.99 0.93 0.92 0.91 0.88 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.9, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.73, 0.62, 0.5, 0.37, 0.19, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111111111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 00000010000000000000 loss: 0.173751, lagrangian_loss: 0.006194, attention_score_distillation_loss: 0.000004 loss: 0.098648, lagrangian_loss: 0.003967, attention_score_distillation_loss: 0.000004 ---------------------------------------------------------------------- time: 2023-07-19 15:26:05 Evaluating: matthews_correlation: 0.5669, eval_loss: 0.6447, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.437, expected_sparsity: 0.4146, expected_sequence_sparsity: 0.8956, target_sparsity: 0.393, step: 12250 lambda_1: -2.1864, lambda_2: 73.6680 lambda_3: 0.0000 train remain: [0.99 0.93 0.92 0.91 0.87 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.9, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.73, 0.62, 0.5, 0.37, 0.19, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111111111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 10000000000000000000 loss: 0.060556, lagrangian_loss: 0.003177, attention_score_distillation_loss: 0.000004 loss: 0.124855, lagrangian_loss: 0.005652, attention_score_distillation_loss: 0.000004 ---------------------------------------------------------------------- time: 2023-07-19 15:26:17 Evaluating: matthews_correlation: 0.5591, eval_loss: 0.6454, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.437, expected_sparsity: 0.4146, expected_sequence_sparsity: 0.8956, target_sparsity: 0.3946, step: 12300 lambda_1: -2.4152, lambda_2: 73.9560 lambda_3: 0.0000 train remain: [0.99 0.93 0.92 0.91 0.87 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.9, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.73, 0.62, 0.5, 0.37, 0.19, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111111111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 00000000000000000001 loss: 0.251653, lagrangian_loss: -0.000820, attention_score_distillation_loss: 0.000004 loss: 0.036858, lagrangian_loss: 0.004874, attention_score_distillation_loss: 0.000004 ETA: 0:59:59 | Epoch 45 finished. Took 64.93 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:26:30 Evaluating: matthews_correlation: 0.5522, eval_loss: 0.6563, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.437, expected_sparsity: 0.4146, expected_sequence_sparsity: 0.8956, target_sparsity: 0.3962, step: 12350 lambda_1: -2.3986, lambda_2: 74.1688 lambda_3: 0.0000 train remain: [0.99 0.92 0.92 0.91 0.87 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.9, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.73, 0.62, 0.5, 0.37, 0.19, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111111111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 00000000000000000001 loss: 0.018128, lagrangian_loss: -0.000187, attention_score_distillation_loss: 0.000003 loss: 0.189007, lagrangian_loss: -0.011121, attention_score_distillation_loss: 0.000004 ---------------------------------------------------------------------- time: 2023-07-19 15:26:42 Evaluating: matthews_correlation: 0.5622, eval_loss: 0.6628, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.437, expected_sparsity: 0.4146, expected_sequence_sparsity: 0.8956, target_sparsity: 0.3978, step: 12400 lambda_1: -1.8425, lambda_2: 74.7198 lambda_3: 0.0000 train remain: [0.99 0.92 0.92 0.9 0.87 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.9, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.73, 0.62, 0.5, 0.37, 0.19, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111111111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000100000 00000000000001000000 loss: 0.290776, lagrangian_loss: -0.005113, attention_score_distillation_loss: 0.000003 loss: 0.018239, lagrangian_loss: -0.004204, attention_score_distillation_loss: 0.000003 ---------------------------------------------------------------------- time: 2023-07-19 15:26:55 Evaluating: matthews_correlation: 0.5556, eval_loss: 0.6641, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4248, expected_sequence_sparsity: 0.8974, target_sparsity: 0.3995, step: 12450 lambda_1: -0.8843, lambda_2: 75.5357 lambda_3: 0.0000 train remain: [0.99 0.92 0.91 0.89 0.87 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.18, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10110010000000000000 10000000000000000000 loss: 0.033300, lagrangian_loss: -0.002203, attention_score_distillation_loss: 0.000003 loss: 0.013774, lagrangian_loss: -0.000901, attention_score_distillation_loss: 0.000003 ---------------------------------------------------------------------- time: 2023-07-19 15:27:07 Evaluating: matthews_correlation: 0.5648, eval_loss: 0.6425, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4248, expected_sequence_sparsity: 0.8974, target_sparsity: 0.4011, step: 12500 lambda_1: -0.2102, lambda_2: 76.0346 lambda_3: 0.0000 train remain: [0.99 0.92 0.92 0.88 0.87 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.18, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10110010000000000000 10000000000000000000 loss: 0.049194, lagrangian_loss: -0.000002, attention_score_distillation_loss: 0.000003 loss: 0.069536, lagrangian_loss: 0.000198, attention_score_distillation_loss: 0.000003 ---------------------------------------------------------------------- time: 2023-07-19 15:27:20 Evaluating: matthews_correlation: 0.5574, eval_loss: 0.651, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.437, expected_sparsity: 0.4146, expected_sequence_sparsity: 0.8956, target_sparsity: 0.4027, step: 12550 lambda_1: -0.1553, lambda_2: 76.4018 lambda_3: 0.0000 train remain: [0.99 0.92 0.92 0.89 0.87 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.9, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.73, 0.62, 0.5, 0.37, 0.19, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111111111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000001000 10000000000000000000 loss: 0.245141, lagrangian_loss: -0.000074, attention_score_distillation_loss: 0.000003 loss: 0.007739, lagrangian_loss: 0.002959, attention_score_distillation_loss: 0.000003 ETA: 0:58:50 | Epoch 46 finished. Took 64.79 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:27:32 Evaluating: matthews_correlation: 0.5587, eval_loss: 0.6429, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4248, expected_sequence_sparsity: 0.8974, target_sparsity: 0.4043, step: 12600 lambda_1: -0.8723, lambda_2: 76.9613 lambda_3: 0.0000 train remain: [1. 0.92 0.92 0.88 0.87 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.18, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000100000 10000000000000000000 loss: 0.013572, lagrangian_loss: 0.004689, attention_score_distillation_loss: 0.000003 loss: 0.013754, lagrangian_loss: 0.009452, attention_score_distillation_loss: 0.000003 ---------------------------------------------------------------------- time: 2023-07-19 15:27:45 Evaluating: matthews_correlation: 0.563, eval_loss: 0.6398, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4248, expected_sequence_sparsity: 0.8974, target_sparsity: 0.4059, step: 12650 lambda_1: -1.9368, lambda_2: 77.8749 lambda_3: 0.0000 train remain: [1. 0.92 0.92 0.88 0.87 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.18, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 10000000000000000000 loss: 0.009439, lagrangian_loss: 0.013242, attention_score_distillation_loss: 0.000002 loss: 0.038419, lagrangian_loss: -0.000194, attention_score_distillation_loss: 0.000002 ---------------------------------------------------------------------- time: 2023-07-19 15:27:57 Evaluating: matthews_correlation: 0.5582, eval_loss: 0.6654, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4248, expected_sequence_sparsity: 0.8974, target_sparsity: 0.4075, step: 12700 lambda_1: -2.6398, lambda_2: 78.4924 lambda_3: 0.0000 train remain: [0.99 0.92 0.91 0.87 0.87 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.18, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 10000000000000000000 loss: 0.015749, lagrangian_loss: 0.012035, attention_score_distillation_loss: 0.000002 loss: 0.044506, lagrangian_loss: 0.000566, attention_score_distillation_loss: 0.000002 ---------------------------------------------------------------------- time: 2023-07-19 15:28:09 Evaluating: matthews_correlation: 0.56, eval_loss: 0.6445, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4248, expected_sequence_sparsity: 0.8974, target_sparsity: 0.4091, step: 12750 lambda_1: -2.9326, lambda_2: 78.8742 lambda_3: 0.0000 train remain: [0.99 0.92 0.91 0.87 0.86 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.18, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 10000000000000000000 loss: 0.012405, lagrangian_loss: 0.002690, attention_score_distillation_loss: 0.000002 loss: 0.105821, lagrangian_loss: 0.011785, attention_score_distillation_loss: 0.000002 ---------------------------------------------------------------------- time: 2023-07-19 15:28:22 Evaluating: matthews_correlation: 0.5656, eval_loss: 0.6487, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4248, expected_sequence_sparsity: 0.8974, target_sparsity: 0.4107, step: 12800 lambda_1: -3.5290, lambda_2: 79.4356 lambda_3: 0.0000 train remain: [0.99 0.92 0.91 0.87 0.86 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.18, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000100000000 10000000000000000000 loss: 0.033446, lagrangian_loss: 0.012471, attention_score_distillation_loss: 0.000002 loss: 0.343924, lagrangian_loss: 0.000644, attention_score_distillation_loss: 0.000002 ---------------------------------------------------------------------- time: 2023-07-19 15:28:34 Evaluating: matthews_correlation: 0.5587, eval_loss: 0.6364, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4248, expected_sequence_sparsity: 0.8974, target_sparsity: 0.4123, step: 12850 lambda_1: -4.3267, lambda_2: 80.1293 lambda_3: 0.0000 train remain: [0.99 0.92 0.91 0.86 0.86 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.18, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000001000000 10000000000000000000 loss: 0.297354, lagrangian_loss: 0.002874, attention_score_distillation_loss: 0.000002 ETA: 0:57:47 | Epoch 47 finished. Took 69.84 seconds. loss: 0.017016, lagrangian_loss: 0.023017, attention_score_distillation_loss: 0.000002 ---------------------------------------------------------------------- time: 2023-07-19 15:28:46 Evaluating: matthews_correlation: 0.5582, eval_loss: 0.6525, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4248, expected_sequence_sparsity: 0.8974, target_sparsity: 0.4139, step: 12900 lambda_1: -5.0539, lambda_2: 80.8944 lambda_3: 0.0000 train remain: [0.99 0.92 0.91 0.86 0.86 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.18, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 00000010000000000000 loss: 0.011974, lagrangian_loss: 0.011646, attention_score_distillation_loss: 0.000002 loss: 0.058334, lagrangian_loss: 0.055785, attention_score_distillation_loss: 0.000001 ---------------------------------------------------------------------- time: 2023-07-19 15:28:59 Evaluating: matthews_correlation: 0.56, eval_loss: 0.6478, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4248, expected_sequence_sparsity: 0.8974, target_sparsity: 0.4155, step: 12950 lambda_1: -5.6006, lambda_2: 81.5020 lambda_3: 0.0000 train remain: [0.99 0.92 0.91 0.86 0.86 0.82 0.76 0.52 0.19 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.18, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 00000000000000000001 loss: 0.159870, lagrangian_loss: -0.023206, attention_score_distillation_loss: 0.000002 loss: 0.119351, lagrangian_loss: 0.029666, attention_score_distillation_loss: 0.000001 ---------------------------------------------------------------------- time: 2023-07-19 15:29:11 Evaluating: matthews_correlation: 0.5604, eval_loss: 0.648, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4248, expected_sequence_sparsity: 0.8974, target_sparsity: 0.4171, step: 13000 lambda_1: -5.5897, lambda_2: 81.8692 lambda_3: 0.0000 train remain: [0.99 0.91 0.91 0.86 0.86 0.82 0.76 0.52 0.18 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.18, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 00000000000000000001 loss: 0.109510, lagrangian_loss: -0.030583, attention_score_distillation_loss: 0.000001 loss: 0.046709, lagrangian_loss: 0.015762, attention_score_distillation_loss: 0.000001 ---------------------------------------------------------------------- time: 2023-07-19 15:29:24 Evaluating: matthews_correlation: 0.5518, eval_loss: 0.667, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4248, expected_sequence_sparsity: 0.8974, target_sparsity: 0.4187, step: 13050 lambda_1: -5.6256, lambda_2: 82.2980 lambda_3: 0.0000 train remain: [0.98 0.91 0.91 0.86 0.86 0.81 0.76 0.52 0.18 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.18, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 00000000000000000001 loss: 0.080391, lagrangian_loss: -0.009136, attention_score_distillation_loss: 0.000001 loss: 0.062059, lagrangian_loss: 0.024923, attention_score_distillation_loss: 0.000001 ---------------------------------------------------------------------- time: 2023-07-19 15:29:36 Evaluating: matthews_correlation: 0.5509, eval_loss: 0.6659, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4248, expected_sequence_sparsity: 0.8974, target_sparsity: 0.4203, step: 13100 lambda_1: -5.6646, lambda_2: 82.6761 lambda_3: 0.0000 train remain: [0.98 0.91 0.91 0.86 0.86 0.81 0.76 0.51 0.18 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.2, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.18, 0.04, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000001 00000000010000000000 loss: 0.311434, lagrangian_loss: -0.013854, attention_score_distillation_loss: 0.000001 loss: 0.086008, lagrangian_loss: 0.002281, attention_score_distillation_loss: 0.000001 ETA: 0:56:38 | Epoch 48 finished. Took 64.27 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:29:48 Evaluating: matthews_correlation: 0.5539, eval_loss: 0.6521, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4255, expected_sequence_sparsity: 0.8976, target_sparsity: 0.4219, step: 13150 lambda_1: -5.7489, lambda_2: 82.9288 lambda_3: 0.0000 train remain: [0.98 0.91 0.91 0.86 0.86 0.81 0.76 0.51 0.17 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.18, 0.03, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000000 00000000000000000001 loss: 0.096986, lagrangian_loss: -0.006381, attention_score_distillation_loss: 0.000001 loss: 0.021713, lagrangian_loss: -0.012491, attention_score_distillation_loss: 0.000001 ---------------------------------------------------------------------- time: 2023-07-19 15:30:01 Evaluating: matthews_correlation: 0.541, eval_loss: 0.6632, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4255, expected_sequence_sparsity: 0.8976, target_sparsity: 0.4235, step: 13200 lambda_1: -5.2037, lambda_2: 83.4362 lambda_3: 0.0000 train remain: [0.97 0.91 0.9 0.86 0.86 0.81 0.76 0.5 0.17 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.18, 0.03, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000000 00000000000000000001 loss: 0.099264, lagrangian_loss: -0.030334, attention_score_distillation_loss: 0.000001 loss: 0.167696, lagrangian_loss: -0.000448, attention_score_distillation_loss: 0.000001 Starting saving the best from epoch 49 and step 13250 ---------------------------------------------------------------------- time: 2023-07-19 15:30:13 Evaluating: matthews_correlation: 0.5488, eval_loss: 0.6552, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4444, expected_sequence_sparsity: 0.901, target_sparsity: 0.4251, step: 13250 lambda_1: -4.5620, lambda_2: 83.9948 lambda_3: 0.0000 train remain: [0.97 0.91 0.9 0.86 0.86 0.81 0.76 0.49 0.16 0.05] infer remain: [0.95, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.56, 0.44, 0.33, 0.17, 0.03, 0.0] 11111111111111111110 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000000 00000000000000000001 Saving the best model so far: [Epoch 49 | Step: 13250 | MACs sparsity: 0.4825 | Score: 0.5488 | Loss: 0.6552] loss: 0.192551, lagrangian_loss: -0.006780, attention_score_distillation_loss: 0.000001 loss: 0.092234, lagrangian_loss: 0.003691, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:31:20 Evaluating: matthews_correlation: 0.5613, eval_loss: 0.6445, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4444, expected_sequence_sparsity: 0.901, target_sparsity: 0.4267, step: 13300 lambda_1: -3.9501, lambda_2: 84.5221 lambda_3: 0.0000 train remain: [0.97 0.91 0.9 0.86 0.86 0.81 0.76 0.48 0.16 0.05] infer remain: [0.95, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.56, 0.44, 0.33, 0.17, 0.03, 0.0] 11111111111111111110 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010001010100 10010010000000000000 10000000000000000000 Best eval score so far: 0.5488 @ step 13250 epoch 49.44 Saving the best model so far: [Epoch 49 | Step: 13300 | MACs sparsity: 0.4825 | Score: 0.5613 | Loss: 0.6445] loss: 0.116650, lagrangian_loss: -0.015507, attention_score_distillation_loss: 0.000000 loss: 0.015322, lagrangian_loss: 0.012517, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:32:26 Evaluating: matthews_correlation: 0.5619, eval_loss: 0.6455, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4271, expected_sequence_sparsity: 0.8979, target_sparsity: 0.4283, step: 13350 lambda_1: -3.9957, lambda_2: 85.0085 lambda_3: 0.0000 train remain: [0.97 0.91 0.9 0.86 0.86 0.81 0.76 0.47 0.16 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.16, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 10000000000000000000 Best eval score so far: 0.5613 @ step 13300 epoch 49.63 Saving the best model so far: [Epoch 49 | Step: 13350 | MACs sparsity: 0.4598 | Score: 0.5619 | Loss: 0.6455] loss: 0.041762, lagrangian_loss: 0.000392, attention_score_distillation_loss: 0.000000 loss: 0.064095, lagrangian_loss: -0.009429, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:33:34 Evaluating: matthews_correlation: 0.5648, eval_loss: 0.6485, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4458, expected_sequence_sparsity: 0.9013, target_sparsity: 0.4299, step: 13400 lambda_1: -4.2261, lambda_2: 85.4723 lambda_3: 0.0000 train remain: [0.97 0.91 0.9 0.86 0.86 0.81 0.76 0.47 0.15 0.05] infer remain: [0.95, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.56, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000000000000001 Best eval score so far: 0.5619 @ step 13350 epoch 49.81 Saving the best model so far: [Epoch 50 | Step: 13400 | MACs sparsity: 0.4825 | Score: 0.5648 | Loss: 0.6485] ETA: 0:59:15 | Epoch 49 finished. Took 290.55 seconds. loss: 0.104926, lagrangian_loss: 0.031639, attention_score_distillation_loss: 0.000000 loss: 0.031427, lagrangian_loss: 0.005863, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:34:43 Evaluating: matthews_correlation: 0.5608, eval_loss: 0.6535, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4458, expected_sequence_sparsity: 0.9013, target_sparsity: 0.43, step: 13450 lambda_1: -3.9818, lambda_2: 86.0748 lambda_3: 0.0000 train remain: [0.97 0.91 0.9 0.86 0.86 0.81 0.76 0.47 0.15 0.05] infer remain: [0.95, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.56, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000000000000001 Best eval score so far: 0.5648 @ step 13400 epoch 50.00 loss: 0.114875, lagrangian_loss: -0.002128, attention_score_distillation_loss: 0.000000 loss: 0.021289, lagrangian_loss: -0.002031, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:34:56 Evaluating: matthews_correlation: 0.5608, eval_loss: 0.6609, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4458, expected_sequence_sparsity: 0.9013, target_sparsity: 0.43, step: 13500 lambda_1: -3.6040, lambda_2: 86.5499 lambda_3: 0.0000 train remain: [0.97 0.91 0.9 0.86 0.86 0.81 0.76 0.47 0.15 0.05] infer remain: [0.95, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.56, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000010000000000 Best eval score so far: 0.5648 @ step 13400 epoch 50.00 loss: 0.146220, lagrangian_loss: -0.011860, attention_score_distillation_loss: 0.000000 loss: 0.356937, lagrangian_loss: 0.002902, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:35:08 Evaluating: matthews_correlation: 0.5682, eval_loss: 0.6403, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4458, expected_sequence_sparsity: 0.9013, target_sparsity: 0.43, step: 13550 lambda_1: -3.4718, lambda_2: 86.9005 lambda_3: 0.0000 train remain: [0.97 0.91 0.9 0.86 0.86 0.81 0.76 0.47 0.15 0.05] infer remain: [0.95, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.56, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000010000000000 Best eval score so far: 0.5648 @ step 13400 epoch 50.00 Saving the best model so far: [Epoch 50 | Step: 13550 | MACs sparsity: 0.4825 | Score: 0.5682 | Loss: 0.6403] loss: 0.113300, lagrangian_loss: -0.002627, attention_score_distillation_loss: 0.000000 loss: 0.025121, lagrangian_loss: -0.008093, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:37:47 Evaluating: matthews_correlation: 0.5699, eval_loss: 0.6312, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4458, expected_sequence_sparsity: 0.9013, target_sparsity: 0.43, step: 13600 lambda_1: -3.3258, lambda_2: 87.1839 lambda_3: 0.0000 train remain: [0.97 0.91 0.9 0.86 0.86 0.81 0.76 0.47 0.15 0.05] infer remain: [0.95, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.56, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 10000000000000000000 Best eval score so far: 0.5682 @ step 13550 epoch 50.56 Saving the best model so far: [Epoch 50 | Step: 13600 | MACs sparsity: 0.4825 | Score: 0.5699 | Loss: 0.6312] loss: 0.198777, lagrangian_loss: 0.004025, attention_score_distillation_loss: 0.000000 loss: 0.084951, lagrangian_loss: -0.008308, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:38:37 Evaluating: matthews_correlation: 0.5837, eval_loss: 0.6503, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4458, expected_sequence_sparsity: 0.9013, target_sparsity: 0.43, step: 13650 lambda_1: -3.3421, lambda_2: 87.6556 lambda_3: 0.0000 train remain: [0.97 0.91 0.9 0.86 0.86 0.81 0.76 0.47 0.15 0.05] infer remain: [0.95, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.77, 0.65, 0.56, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 10000000000000000000 Best eval score so far: 0.5699 @ step 13600 epoch 50.75 Saving the best model so far: [Epoch 50 | Step: 13650 | MACs sparsity: 0.4825 | Score: 0.5837 | Loss: 0.6503] loss: 0.239102, lagrangian_loss: -0.006552, attention_score_distillation_loss: 0.000000 ETA: 1:01:34 | Epoch 50 finished. Took 289.57 seconds. loss: 0.093645, lagrangian_loss: 0.014254, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:39:30 Evaluating: matthews_correlation: 0.5789, eval_loss: 0.639, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4271, expected_sequence_sparsity: 0.8979, target_sparsity: 0.43, step: 13700 lambda_1: -3.5008, lambda_2: 88.0629 lambda_3: 0.0000 train remain: [0.98 0.91 0.9 0.86 0.86 0.81 0.76 0.47 0.15 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.16, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 10000000000000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.029678, lagrangian_loss: -0.006830, attention_score_distillation_loss: 0.000000 loss: 0.324671, lagrangian_loss: 0.017642, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:39:43 Evaluating: matthews_correlation: 0.5663, eval_loss: 0.642, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4271, expected_sequence_sparsity: 0.8979, target_sparsity: 0.43, step: 13750 lambda_1: -3.7021, lambda_2: 88.5442 lambda_3: 0.0000 train remain: [0.98 0.9 0.9 0.86 0.86 0.81 0.76 0.47 0.15 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.16, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 10000000000000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.172385, lagrangian_loss: -0.004517, attention_score_distillation_loss: 0.000000 loss: 0.155050, lagrangian_loss: 0.013487, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:39:55 Evaluating: matthews_correlation: 0.5722, eval_loss: 0.6288, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4271, expected_sequence_sparsity: 0.8979, target_sparsity: 0.43, step: 13800 lambda_1: -4.0039, lambda_2: 88.9168 lambda_3: 0.0000 train remain: [0.98 0.9 0.9 0.86 0.86 0.81 0.76 0.46 0.15 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.16, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 10000000000000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.015722, lagrangian_loss: -0.007664, attention_score_distillation_loss: 0.000000 loss: 0.093557, lagrangian_loss: -0.003425, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:40:08 Evaluating: matthews_correlation: 0.5648, eval_loss: 0.6532, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4271, expected_sequence_sparsity: 0.8979, target_sparsity: 0.43, step: 13850 lambda_1: -4.0707, lambda_2: 89.2443 lambda_3: 0.0000 train remain: [0.98 0.9 0.9 0.86 0.86 0.81 0.76 0.46 0.15 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.16, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000000000000001 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.044132, lagrangian_loss: -0.000121, attention_score_distillation_loss: 0.000000 loss: 0.053605, lagrangian_loss: 0.011310, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:40:20 Evaluating: matthews_correlation: 0.5533, eval_loss: 0.6665, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4271, expected_sequence_sparsity: 0.8979, target_sparsity: 0.43, step: 13900 lambda_1: -3.9513, lambda_2: 89.6083 lambda_3: 0.0000 train remain: [0.98 0.91 0.9 0.86 0.86 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.16, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000000000000001 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.102304, lagrangian_loss: -0.013883, attention_score_distillation_loss: 0.000000 loss: 0.143376, lagrangian_loss: 0.003918, attention_score_distillation_loss: 0.000000 ETA: 1:00:09 | Epoch 51 finished. Took 64.7 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:40:33 Evaluating: matthews_correlation: 0.5644, eval_loss: 0.654, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4271, expected_sequence_sparsity: 0.8979, target_sparsity: 0.43, step: 13950 lambda_1: -3.8580, lambda_2: 89.9613 lambda_3: 0.0000 train remain: [0.98 0.91 0.89 0.86 0.86 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.16, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000000001000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.339845, lagrangian_loss: -0.013122, attention_score_distillation_loss: 0.000000 loss: 0.108715, lagrangian_loss: -0.015764, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:40:45 Evaluating: matthews_correlation: 0.5708, eval_loss: 0.6599, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4271, expected_sequence_sparsity: 0.8979, target_sparsity: 0.43, step: 14000 lambda_1: -3.2578, lambda_2: 90.6802 lambda_3: 0.0000 train remain: [0.98 0.91 0.89 0.86 0.86 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.16, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000000000000001 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.037809, lagrangian_loss: -0.004437, attention_score_distillation_loss: 0.000000 loss: 0.420646, lagrangian_loss: -0.005212, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:40:57 Evaluating: matthews_correlation: 0.5515, eval_loss: 0.6733, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4271, expected_sequence_sparsity: 0.8979, target_sparsity: 0.43, step: 14050 lambda_1: -2.1937, lambda_2: 91.7546 lambda_3: 0.0000 train remain: [0.97 0.91 0.88 0.86 0.86 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.16, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000000000000001 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.238563, lagrangian_loss: -0.010756, attention_score_distillation_loss: 0.000000 loss: 0.065487, lagrangian_loss: -0.003227, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:41:10 Evaluating: matthews_correlation: 0.5637, eval_loss: 0.6552, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4271, expected_sequence_sparsity: 0.8979, target_sparsity: 0.43, step: 14100 lambda_1: -0.8808, lambda_2: 93.1917 lambda_3: 0.0000 train remain: [0.98 0.91 0.88 0.86 0.86 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.16, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000010000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.156565, lagrangian_loss: -0.001528, attention_score_distillation_loss: 0.000000 loss: 0.104967, lagrangian_loss: 0.001786, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:41:22 Evaluating: matthews_correlation: 0.5611, eval_loss: 0.6471, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4271, expected_sequence_sparsity: 0.8979, target_sparsity: 0.43, step: 14150 lambda_1: 0.0765, lambda_2: 94.1417 lambda_3: 0.0000 train remain: [0.98 0.91 0.88 0.86 0.86 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.16, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 10000000000000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.188652, lagrangian_loss: 0.000231, attention_score_distillation_loss: 0.000000 loss: 0.018130, lagrangian_loss: -0.000198, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:41:35 Evaluating: matthews_correlation: 0.5657, eval_loss: 0.6598, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4271, expected_sequence_sparsity: 0.8979, target_sparsity: 0.43, step: 14200 lambda_1: 0.9440, lambda_2: 94.9602 lambda_3: 0.0000 train remain: [0.98 0.91 0.89 0.86 0.86 0.81 0.76 0.47 0.15 0.06] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.16, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000010000 00000000000000000001 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.146304, lagrangian_loss: -0.000619, attention_score_distillation_loss: 0.000000 ETA: 0:58:49 | Epoch 52 finished. Took 70.14 seconds. loss: 0.060414, lagrangian_loss: 0.002374, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:41:47 Evaluating: matthews_correlation: 0.5604, eval_loss: 0.6497, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4255, expected_sequence_sparsity: 0.8976, target_sparsity: 0.43, step: 14250 lambda_1: 0.7794, lambda_2: 95.4590 lambda_3: 0.0000 train remain: [0.98 0.91 0.89 0.86 0.87 0.81 0.77 0.48 0.16 0.07] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.5, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.18, 0.03, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010101 10010000000000000001 00000000000000000001 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.014117, lagrangian_loss: -0.000436, attention_score_distillation_loss: 0.000000 loss: 0.063480, lagrangian_loss: -0.000108, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:42:00 Evaluating: matthews_correlation: 0.5663, eval_loss: 0.6567, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4271, expected_sequence_sparsity: 0.8979, target_sparsity: 0.43, step: 14300 lambda_1: -0.4597, lambda_2: 96.7514 lambda_3: 0.0000 train remain: [0.98 0.91 0.89 0.86 0.86 0.81 0.76 0.47 0.16 0.06] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.16, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000000001 00000000000000000001 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.044034, lagrangian_loss: 0.001476, attention_score_distillation_loss: 0.000000 loss: 0.053352, lagrangian_loss: -0.000591, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:42:12 Evaluating: matthews_correlation: 0.5637, eval_loss: 0.6562, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4271, expected_sequence_sparsity: 0.8979, target_sparsity: 0.43, step: 14350 lambda_1: -0.9792, lambda_2: 97.2164 lambda_3: 0.0000 train remain: [0.98 0.91 0.89 0.86 0.86 0.81 0.76 0.47 0.15 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.16, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010001000000000000 00000010000000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.557882, lagrangian_loss: -0.000832, attention_score_distillation_loss: 0.000000 loss: 0.179242, lagrangian_loss: 0.007573, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:42:25 Evaluating: matthews_correlation: 0.5744, eval_loss: 0.6345, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4271, expected_sequence_sparsity: 0.8979, target_sparsity: 0.43, step: 14400 lambda_1: -1.1252, lambda_2: 97.5533 lambda_3: 0.0000 train remain: [0.98 0.91 0.89 0.86 0.86 0.81 0.76 0.47 0.14 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.16, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000010000000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.082231, lagrangian_loss: 0.007647, attention_score_distillation_loss: 0.000000 loss: 0.058085, lagrangian_loss: 0.015161, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:42:37 Evaluating: matthews_correlation: 0.5737, eval_loss: 0.645, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4598, expected_sparsity: 0.4271, expected_sequence_sparsity: 0.8979, target_sparsity: 0.43, step: 14450 lambda_1: -1.6457, lambda_2: 98.0022 lambda_3: 0.0000 train remain: [0.98 0.91 0.89 0.86 0.86 0.81 0.76 0.47 0.14 0.05] infer remain: [1.0, 0.9, 0.9, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.81, 0.69, 0.59, 0.47, 0.35, 0.16, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111110 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10011000000000000000 00000010000000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.112407, lagrangian_loss: -0.002259, attention_score_distillation_loss: 0.000000 ETA: 0:57:25 | Epoch 53 finished. Took 64.79 seconds. loss: 0.099393, lagrangian_loss: -0.006033, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:42:49 Evaluating: matthews_correlation: 0.5756, eval_loss: 0.6083, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 14500 lambda_1: -1.8314, lambda_2: 98.3741 lambda_3: 0.0000 train remain: [0.98 0.91 0.88 0.86 0.86 0.81 0.76 0.47 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000000001 00000000010000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.054172, lagrangian_loss: -0.002762, attention_score_distillation_loss: 0.000000 loss: 0.072483, lagrangian_loss: 0.004212, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:43:02 Evaluating: matthews_correlation: 0.5837, eval_loss: 0.6261, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 14550 lambda_1: -1.8900, lambda_2: 98.7103 lambda_3: 0.0000 train remain: [0.98 0.91 0.88 0.86 0.86 0.81 0.76 0.47 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000000000000001 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.020021, lagrangian_loss: -0.008516, attention_score_distillation_loss: 0.000000 loss: 0.052388, lagrangian_loss: 0.003628, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:43:14 Evaluating: matthews_correlation: 0.5725, eval_loss: 0.6157, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 14600 lambda_1: -1.6252, lambda_2: 99.1545 lambda_3: 0.0000 train remain: [0.98 0.91 0.88 0.86 0.86 0.81 0.76 0.47 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010001000000000000 10000000000000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.024700, lagrangian_loss: -0.002474, attention_score_distillation_loss: 0.000000 loss: 0.193251, lagrangian_loss: -0.001244, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:43:27 Evaluating: matthews_correlation: 0.5785, eval_loss: 0.6195, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 14650 lambda_1: -1.1291, lambda_2: 99.5728 lambda_3: 0.0000 train remain: [0.98 0.91 0.88 0.86 0.86 0.81 0.76 0.47 0.15 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000100000000 00000000010000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.287620, lagrangian_loss: -0.002851, attention_score_distillation_loss: 0.000000 loss: 0.230380, lagrangian_loss: -0.002016, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:43:39 Evaluating: matthews_correlation: 0.5777, eval_loss: 0.6304, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 14700 lambda_1: -0.6208, lambda_2: 100.0804 lambda_3: 0.0000 train remain: [0.98 0.91 0.88 0.86 0.86 0.81 0.76 0.47 0.15 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000010000 00000000010000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.105329, lagrangian_loss: -0.000143, attention_score_distillation_loss: 0.000000 loss: 0.294497, lagrangian_loss: 0.002240, attention_score_distillation_loss: 0.000000 ETA: 0:56:02 | Epoch 54 finished. Took 64.95 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:43:52 Evaluating: matthews_correlation: 0.5789, eval_loss: 0.6176, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 14750 lambda_1: -0.4896, lambda_2: 100.5285 lambda_3: 0.0000 train remain: [0.98 0.91 0.88 0.86 0.86 0.81 0.76 0.47 0.15 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000001000000 00000000010000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.068494, lagrangian_loss: 0.002003, attention_score_distillation_loss: 0.000000 loss: 0.060202, lagrangian_loss: 0.004682, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:44:04 Evaluating: matthews_correlation: 0.5799, eval_loss: 0.6242, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 14800 lambda_1: -0.8774, lambda_2: 100.9876 lambda_3: 0.0000 train remain: [0.98 0.91 0.88 0.86 0.86 0.81 0.76 0.47 0.15 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000010000 00000000010000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.085510, lagrangian_loss: 0.005127, attention_score_distillation_loss: 0.000000 loss: 0.597183, lagrangian_loss: -0.003157, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:44:17 Evaluating: matthews_correlation: 0.5686, eval_loss: 0.6128, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 14850 lambda_1: -1.2959, lambda_2: 101.5767 lambda_3: 0.0000 train remain: [0.98 0.91 0.88 0.86 0.86 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000010000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.289920, lagrangian_loss: 0.001225, attention_score_distillation_loss: 0.000000 loss: 0.082837, lagrangian_loss: 0.013540, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:44:29 Evaluating: matthews_correlation: 0.5774, eval_loss: 0.6192, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 14900 lambda_1: -1.7087, lambda_2: 102.0993 lambda_3: 0.0000 train remain: [0.99 0.91 0.88 0.86 0.86 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000000001 00000000010000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.042569, lagrangian_loss: 0.002196, attention_score_distillation_loss: 0.000000 loss: 0.179071, lagrangian_loss: -0.001205, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:44:42 Evaluating: matthews_correlation: 0.5756, eval_loss: 0.6175, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 14950 lambda_1: -2.2483, lambda_2: 102.6980 lambda_3: 0.0000 train remain: [0.98 0.91 0.88 0.86 0.86 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000000001 00000000010000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.217522, lagrangian_loss: 0.003449, attention_score_distillation_loss: 0.000000 loss: 0.034789, lagrangian_loss: 0.015273, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:44:54 Evaluating: matthews_correlation: 0.5711, eval_loss: 0.6291, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15000 lambda_1: -2.3664, lambda_2: 103.0395 lambda_3: 0.0000 train remain: [0.98 0.91 0.87 0.86 0.87 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000000001 00000000010000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.028713, lagrangian_loss: -0.001621, attention_score_distillation_loss: 0.000000 ETA: 0:54:44 | Epoch 55 finished. Took 70.33 seconds. loss: 0.043683, lagrangian_loss: -0.007860, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:45:07 Evaluating: matthews_correlation: 0.5757, eval_loss: 0.6327, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15050 lambda_1: -2.0549, lambda_2: 103.4810 lambda_3: 0.0000 train remain: [0.98 0.91 0.87 0.86 0.87 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000100000000 00000000010000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 loss: 0.377528, lagrangian_loss: -0.006342, attention_score_distillation_loss: 0.000000 loss: 0.155213, lagrangian_loss: 0.006327, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:45:19 Evaluating: matthews_correlation: 0.5844, eval_loss: 0.6177, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15100 lambda_1: -1.7531, lambda_2: 103.9031 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.86 0.87 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000100000000000 00000000010000000000 Best eval score so far: 0.5837 @ step 13650 epoch 50.93 Saving the best model so far: [Epoch 56 | Step: 15100 | MACs sparsity: 0.4674 | Score: 0.5844 | Loss: 0.6177] loss: 0.243499, lagrangian_loss: -0.004343, attention_score_distillation_loss: 0.000000 loss: 0.029357, lagrangian_loss: 0.007473, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:46:13 Evaluating: matthews_correlation: 0.5782, eval_loss: 0.6299, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15150 lambda_1: -1.9201, lambda_2: 104.3274 lambda_3: 0.0000 train remain: [0.98 0.92 0.88 0.86 0.87 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000000001 00000000010000000000 Best eval score so far: 0.5844 @ step 15100 epoch 56.34 loss: 0.123490, lagrangian_loss: -0.000349, attention_score_distillation_loss: 0.000000 loss: 0.026743, lagrangian_loss: 0.006117, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:46:26 Evaluating: matthews_correlation: 0.5763, eval_loss: 0.6329, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15200 lambda_1: -2.1067, lambda_2: 104.8077 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.86 0.87 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000010000000000 Best eval score so far: 0.5844 @ step 15100 epoch 56.34 loss: 0.014194, lagrangian_loss: -0.003988, attention_score_distillation_loss: 0.000000 loss: 0.105913, lagrangian_loss: -0.004059, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:46:38 Evaluating: matthews_correlation: 0.5708, eval_loss: 0.6123, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15250 lambda_1: -2.0005, lambda_2: 105.1508 lambda_3: 0.0000 train remain: [0.98 0.91 0.87 0.86 0.86 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000001000000 10000000000000000000 Best eval score so far: 0.5844 @ step 15100 epoch 56.34 loss: 0.194480, lagrangian_loss: 0.005873, attention_score_distillation_loss: 0.000000 loss: 0.309483, lagrangian_loss: -0.005302, attention_score_distillation_loss: 0.000000 ETA: 0:53:53 | Epoch 56 finished. Took 106.21 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:46:50 Evaluating: matthews_correlation: 0.5789, eval_loss: 0.6212, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15300 lambda_1: -1.2858, lambda_2: 106.0177 lambda_3: 0.0000 train remain: [0.98 0.91 0.87 0.86 0.86 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000001000000 10000000000000000000 Best eval score so far: 0.5844 @ step 15100 epoch 56.34 loss: 0.347308, lagrangian_loss: -0.003561, attention_score_distillation_loss: 0.000000 loss: 0.099951, lagrangian_loss: -0.001296, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:47:03 Evaluating: matthews_correlation: 0.5729, eval_loss: 0.6042, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15350 lambda_1: -0.4798, lambda_2: 106.7434 lambda_3: 0.0000 train remain: [0.98 0.91 0.87 0.86 0.86 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 10000000000000000000 Best eval score so far: 0.5844 @ step 15100 epoch 56.34 loss: 0.032888, lagrangian_loss: -0.000269, attention_score_distillation_loss: 0.000000 loss: 0.020188, lagrangian_loss: 0.000632, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:47:15 Evaluating: matthews_correlation: 0.5789, eval_loss: 0.613, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15400 lambda_1: -0.0021, lambda_2: 107.1237 lambda_3: 0.0000 train remain: [0.98 0.91 0.87 0.86 0.87 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10011000000000000000 10000000000000000000 Best eval score so far: 0.5844 @ step 15100 epoch 56.34 loss: 0.287369, lagrangian_loss: 0.002434, attention_score_distillation_loss: 0.000000 loss: 0.585889, lagrangian_loss: 0.001421, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:47:28 Evaluating: matthews_correlation: 0.5748, eval_loss: 0.6117, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15450 lambda_1: 0.1361, lambda_2: 107.5783 lambda_3: 0.0000 train remain: [0.98 0.91 0.88 0.86 0.87 0.81 0.76 0.46 0.15 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000100000 10000000000000000000 Best eval score so far: 0.5844 @ step 15100 epoch 56.34 loss: 0.241214, lagrangian_loss: -0.000032, attention_score_distillation_loss: 0.000000 loss: 0.023155, lagrangian_loss: 0.002373, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:47:40 Evaluating: matthews_correlation: 0.5847, eval_loss: 0.6064, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15500 lambda_1: -0.4333, lambda_2: 108.1614 lambda_3: 0.0000 train remain: [0.99 0.91 0.88 0.86 0.87 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000100000 10000000000000000000 Best eval score so far: 0.5844 @ step 15100 epoch 56.34 Saving the best model so far: [Epoch 57 | Step: 15500 | MACs sparsity: 0.4674 | Score: 0.5847 | Loss: 0.6064] loss: 0.225530, lagrangian_loss: 0.002019, attention_score_distillation_loss: 0.000000 loss: 0.289899, lagrangian_loss: -0.001659, attention_score_distillation_loss: 0.000000 ETA: 0:53:01 | Epoch 57 finished. Took 106.79 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:48:35 Evaluating: matthews_correlation: 0.5792, eval_loss: 0.6125, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15550 lambda_1: -1.4106, lambda_2: 109.0542 lambda_3: 0.0000 train remain: [0.99 0.91 0.88 0.86 0.87 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000010000000000 10000000000000000000 Best eval score so far: 0.5847 @ step 15500 epoch 57.84 loss: 0.006574, lagrangian_loss: 0.018286, attention_score_distillation_loss: 0.000000 loss: 0.297958, lagrangian_loss: 0.004657, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:48:47 Evaluating: matthews_correlation: 0.5891, eval_loss: 0.6174, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15600 lambda_1: -2.1097, lambda_2: 109.7909 lambda_3: 0.0000 train remain: [0.98 0.91 0.87 0.86 0.88 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000100000 00000000000000000001 Best eval score so far: 0.5847 @ step 15500 epoch 57.84 Saving the best model so far: [Epoch 58 | Step: 15600 | MACs sparsity: 0.4674 | Score: 0.5891 | Loss: 0.6174] loss: 0.045724, lagrangian_loss: 0.011001, attention_score_distillation_loss: 0.000000 loss: 0.679320, lagrangian_loss: 0.011228, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:49:42 Evaluating: matthews_correlation: 0.5699, eval_loss: 0.6092, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15650 lambda_1: -2.7253, lambda_2: 110.4102 lambda_3: 0.0000 train remain: [0.98 0.91 0.87 0.86 0.87 0.81 0.76 0.46 0.15 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000100000000000 10000000000000000000 Best eval score so far: 0.5891 @ step 15600 epoch 58.21 loss: 0.010239, lagrangian_loss: 0.022625, attention_score_distillation_loss: 0.000000 loss: 0.268935, lagrangian_loss: -0.008078, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:49:54 Evaluating: matthews_correlation: 0.5821, eval_loss: 0.6167, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15700 lambda_1: -2.6883, lambda_2: 110.8527 lambda_3: 0.0000 train remain: [0.98 0.91 0.87 0.86 0.87 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000100000000000 00000000010000000000 Best eval score so far: 0.5891 @ step 15600 epoch 58.21 loss: 0.071818, lagrangian_loss: -0.000209, attention_score_distillation_loss: 0.000000 loss: 0.034466, lagrangian_loss: -0.007845, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:50:07 Evaluating: matthews_correlation: 0.584, eval_loss: 0.6113, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15750 lambda_1: -2.1597, lambda_2: 111.2907 lambda_3: 0.0000 train remain: [0.98 0.91 0.87 0.86 0.87 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000100000000000 00000000010000000000 Best eval score so far: 0.5891 @ step 15600 epoch 58.21 loss: 0.037866, lagrangian_loss: -0.007255, attention_score_distillation_loss: 0.000000 loss: 0.085364, lagrangian_loss: -0.002781, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:50:19 Evaluating: matthews_correlation: 0.5873, eval_loss: 0.6127, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15800 lambda_1: -1.4793, lambda_2: 111.8941 lambda_3: 0.0000 train remain: [0.98 0.91 0.87 0.86 0.88 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000001000 00000000000000100000 Best eval score so far: 0.5891 @ step 15600 epoch 58.21 loss: 0.085183, lagrangian_loss: 0.013498, attention_score_distillation_loss: 0.000000 ETA: 0:52:11 | Epoch 58 finished. Took 112.4 seconds. loss: 0.067741, lagrangian_loss: 0.004863, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:50:31 Evaluating: matthews_correlation: 0.5794, eval_loss: 0.6144, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15850 lambda_1: -1.1392, lambda_2: 112.5177 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.86 0.88 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000000000100000 Best eval score so far: 0.5891 @ step 15600 epoch 58.21 loss: 0.302972, lagrangian_loss: -0.002006, attention_score_distillation_loss: 0.000000 loss: 0.040477, lagrangian_loss: -0.002019, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:50:44 Evaluating: matthews_correlation: 0.5807, eval_loss: 0.6081, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15900 lambda_1: -1.0658, lambda_2: 112.9066 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.86 0.88 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000010000000000 Best eval score so far: 0.5891 @ step 15600 epoch 58.21 loss: 0.249456, lagrangian_loss: 0.000540, attention_score_distillation_loss: 0.000000 loss: 0.115081, lagrangian_loss: 0.001003, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:50:56 Evaluating: matthews_correlation: 0.5939, eval_loss: 0.6183, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 15950 lambda_1: -1.2065, lambda_2: 113.3733 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.86 0.88 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000001000000 00000010000000000000 Best eval score so far: 0.5891 @ step 15600 epoch 58.21 Saving the best model so far: [Epoch 59 | Step: 15950 | MACs sparsity: 0.4674 | Score: 0.5939 | Loss: 0.6183] loss: 0.064554, lagrangian_loss: -0.001192, attention_score_distillation_loss: 0.000000 loss: 0.013521, lagrangian_loss: -0.000567, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:51:43 Evaluating: matthews_correlation: 0.5943, eval_loss: 0.6151, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 16000 lambda_1: -1.0416, lambda_2: 113.7834 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.86 0.87 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010001000000000000 00000010000000000000 Best eval score so far: 0.5939 @ step 15950 epoch 59.51 Saving the best model so far: [Epoch 59 | Step: 16000 | MACs sparsity: 0.4674 | Score: 0.5943 | Loss: 0.6151] loss: 0.076587, lagrangian_loss: 0.002769, attention_score_distillation_loss: 0.000000 loss: 0.243381, lagrangian_loss: -0.001508, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:52:36 Evaluating: matthews_correlation: 0.5828, eval_loss: 0.6225, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 16050 lambda_1: -1.0270, lambda_2: 114.3109 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.86 0.88 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000010000000000 Best eval score so far: 0.5943 @ step 16000 epoch 59.70 loss: 0.012855, lagrangian_loss: -0.000840, attention_score_distillation_loss: 0.000000 loss: 0.068456, lagrangian_loss: -0.001491, attention_score_distillation_loss: 0.000000 ETA: 0:51:36 | Epoch 59 finished. Took 139.35 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:52:48 Evaluating: matthews_correlation: 0.5873, eval_loss: 0.6216, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 16100 lambda_1: -0.9224, lambda_2: 114.7280 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.86 0.88 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000001000000 00000000010000000000 Best eval score so far: 0.5943 @ step 16000 epoch 59.70 loss: 0.040979, lagrangian_loss: 0.010921, attention_score_distillation_loss: 0.000000 loss: 0.039682, lagrangian_loss: 0.003623, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:53:01 Evaluating: matthews_correlation: 0.5734, eval_loss: 0.6279, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 16150 lambda_1: -1.3078, lambda_2: 115.2474 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.86 0.88 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000001000000 00000000010000000000 Best eval score so far: 0.5943 @ step 16000 epoch 59.70 loss: 0.059988, lagrangian_loss: 0.005492, attention_score_distillation_loss: 0.000000 loss: 0.031632, lagrangian_loss: 0.002723, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:53:13 Evaluating: matthews_correlation: 0.5828, eval_loss: 0.6274, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 16200 lambda_1: -1.5497, lambda_2: 115.7280 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.86 0.88 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000010000000000 Best eval score so far: 0.5943 @ step 16000 epoch 59.70 loss: 0.307581, lagrangian_loss: -0.000772, attention_score_distillation_loss: 0.000000 loss: 0.045541, lagrangian_loss: -0.004682, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:53:26 Evaluating: matthews_correlation: 0.5785, eval_loss: 0.6209, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 16250 lambda_1: -1.7653, lambda_2: 116.1559 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.86 0.87 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10011000000000000000 00000000010000000000 Best eval score so far: 0.5943 @ step 16000 epoch 59.70 loss: 0.023601, lagrangian_loss: 0.000857, attention_score_distillation_loss: 0.000000 loss: 0.020285, lagrangian_loss: 0.000466, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:53:39 Evaluating: matthews_correlation: 0.5821, eval_loss: 0.6233, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 16300 lambda_1: -1.8242, lambda_2: 116.6115 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.86 0.87 0.81 0.76 0.46 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000100000000 00000000010000000000 Best eval score so far: 0.5943 @ step 16000 epoch 59.70 loss: 0.019895, lagrangian_loss: -0.004704, attention_score_distillation_loss: 0.000000 loss: 0.024746, lagrangian_loss: -0.004045, attention_score_distillation_loss: 0.000000 ETA: 0:50:11 | Epoch 60 finished. Took 65.43 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:53:51 Evaluating: matthews_correlation: 0.5885, eval_loss: 0.6213, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.44, expected_sequence_sparsity: 0.9002, target_sparsity: 0.43, step: 16350 lambda_1: -1.8648, lambda_2: 117.0016 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.86 0.87 0.81 0.76 0.45 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.15, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000000001 00000000010000000000 Best eval score so far: 0.5943 @ step 16000 epoch 59.70 loss: 0.016175, lagrangian_loss: -0.000615, attention_score_distillation_loss: 0.000000 loss: 0.073867, lagrangian_loss: 0.000478, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:54:04 Evaluating: matthews_correlation: 0.5834, eval_loss: 0.6195, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 16400 lambda_1: -1.5148, lambda_2: 117.4955 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.86 0.88 0.81 0.76 0.45 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000000001 00000000010000000000 Best eval score so far: 0.5943 @ step 16000 epoch 59.70 loss: 0.020380, lagrangian_loss: -0.004021, attention_score_distillation_loss: 0.000000 loss: 0.023523, lagrangian_loss: -0.002052, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:54:17 Evaluating: matthews_correlation: 0.5856, eval_loss: 0.6192, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 16450 lambda_1: -0.6438, lambda_2: 118.4249 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.86 0.88 0.81 0.76 0.45 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000010000000000 Best eval score so far: 0.5943 @ step 16000 epoch 59.70 loss: 0.136541, lagrangian_loss: -0.000854, attention_score_distillation_loss: 0.000000 loss: 0.326093, lagrangian_loss: 0.000053, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:54:29 Evaluating: matthews_correlation: 0.5898, eval_loss: 0.6013, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 16500 lambda_1: 0.0738, lambda_2: 119.3493 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.86 0.88 0.81 0.76 0.46 0.15 0.06] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10011000000000000000 00000000010000000000 Best eval score so far: 0.5943 @ step 16000 epoch 59.70 loss: 0.030290, lagrangian_loss: -0.000011, attention_score_distillation_loss: 0.000000 loss: 0.040511, lagrangian_loss: 0.000474, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:54:42 Evaluating: matthews_correlation: 0.5818, eval_loss: 0.6121, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 16550 lambda_1: 0.0809, lambda_2: 119.7463 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.86 0.88 0.81 0.76 0.46 0.15 0.06] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10011000000000000000 00000000010000000000 Best eval score so far: 0.5943 @ step 16000 epoch 59.70 loss: 0.126577, lagrangian_loss: 0.000123, attention_score_distillation_loss: 0.000000 loss: 0.269463, lagrangian_loss: 0.000799, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:54:54 Evaluating: matthews_correlation: 0.5808, eval_loss: 0.6301, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 16600 lambda_1: -0.7097, lambda_2: 120.6673 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.86 0.88 0.81 0.76 0.46 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10011000000000000000 00000000010000000000 Best eval score so far: 0.5943 @ step 16000 epoch 59.70 loss: 0.385721, lagrangian_loss: 0.000602, attention_score_distillation_loss: 0.000000 ETA: 0:48:50 | Epoch 61 finished. Took 70.82 seconds. loss: 0.260110, lagrangian_loss: 0.017323, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:55:07 Evaluating: matthews_correlation: 0.5774, eval_loss: 0.609, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 16650 lambda_1: -1.6772, lambda_2: 121.6601 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.86 0.88 0.81 0.76 0.46 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000010000000000 Best eval score so far: 0.5943 @ step 16000 epoch 59.70 loss: 0.032990, lagrangian_loss: -0.000481, attention_score_distillation_loss: 0.000000 loss: 0.027180, lagrangian_loss: -0.004711, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:55:19 Evaluating: matthews_correlation: 0.5821, eval_loss: 0.6082, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 16700 lambda_1: -1.5452, lambda_2: 122.1366 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.86 0.88 0.81 0.76 0.46 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000010000000000 Best eval score so far: 0.5943 @ step 16000 epoch 59.70 loss: 0.282466, lagrangian_loss: -0.004219, attention_score_distillation_loss: 0.000000 loss: 0.098816, lagrangian_loss: -0.001602, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:55:31 Evaluating: matthews_correlation: 0.5946, eval_loss: 0.6038, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 16750 lambda_1: -1.1049, lambda_2: 122.6687 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.86 0.88 0.81 0.76 0.46 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000010000000000 Best eval score so far: 0.5943 @ step 16000 epoch 59.70 Saving the best model so far: [Epoch 62 | Step: 16750 | MACs sparsity: 0.4825 | Score: 0.5946 | Loss: 0.6038] loss: 0.028805, lagrangian_loss: -0.001976, attention_score_distillation_loss: 0.000000 loss: 0.005501, lagrangian_loss: 0.006385, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:56:18 Evaluating: matthews_correlation: 0.5847, eval_loss: 0.6059, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 16800 lambda_1: -0.9357, lambda_2: 123.0818 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.86 0.88 0.81 0.76 0.46 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000010000000000 Best eval score so far: 0.5946 @ step 16750 epoch 62.50 loss: 0.251819, lagrangian_loss: -0.000951, attention_score_distillation_loss: 0.000000 loss: 0.045185, lagrangian_loss: 0.005017, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:56:30 Evaluating: matthews_correlation: 0.5799, eval_loss: 0.6101, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 16850 lambda_1: -1.2263, lambda_2: 123.5564 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.86 0.87 0.81 0.76 0.46 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000010000000000 Best eval score so far: 0.5946 @ step 16750 epoch 62.50 loss: 0.017242, lagrangian_loss: -0.002589, attention_score_distillation_loss: 0.000000 loss: 0.149810, lagrangian_loss: 0.001113, attention_score_distillation_loss: 0.000000 ETA: 0:47:46 | Epoch 62 finished. Took 98.62 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:56:43 Evaluating: matthews_correlation: 0.5721, eval_loss: 0.6148, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 16900 lambda_1: -0.9960, lambda_2: 124.0380 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.86 0.88 0.81 0.76 0.46 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 00000000010000000000 Best eval score so far: 0.5946 @ step 16750 epoch 62.50 loss: 0.020501, lagrangian_loss: -0.000015, attention_score_distillation_loss: 0.000000 loss: 0.033485, lagrangian_loss: 0.000326, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:56:55 Evaluating: matthews_correlation: 0.5851, eval_loss: 0.6084, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 16950 lambda_1: -0.7689, lambda_2: 124.6901 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.86 0.88 0.81 0.76 0.46 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000000001 00000000010000000000 Best eval score so far: 0.5946 @ step 16750 epoch 62.50 loss: 0.015303, lagrangian_loss: 0.001569, attention_score_distillation_loss: 0.000000 loss: 0.009850, lagrangian_loss: 0.002814, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:57:08 Evaluating: matthews_correlation: 0.5818, eval_loss: 0.6162, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17000 lambda_1: -0.9722, lambda_2: 125.0147 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.86 0.88 0.81 0.76 0.45 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010001000000000000 00000000010000000000 Best eval score so far: 0.5946 @ step 16750 epoch 62.50 loss: 0.007601, lagrangian_loss: 0.004125, attention_score_distillation_loss: 0.000000 loss: 0.135024, lagrangian_loss: -0.002384, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:57:21 Evaluating: matthews_correlation: 0.5777, eval_loss: 0.6181, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17050 lambda_1: -1.4211, lambda_2: 125.4809 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.87 0.88 0.81 0.76 0.46 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000010000000000 00000000010000000000 Best eval score so far: 0.5946 @ step 16750 epoch 62.50 loss: 0.027369, lagrangian_loss: 0.004533, attention_score_distillation_loss: 0.000000 loss: 0.008380, lagrangian_loss: -0.000827, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:57:33 Evaluating: matthews_correlation: 0.5908, eval_loss: 0.6223, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17100 lambda_1: -1.8615, lambda_2: 125.9534 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.87 0.88 0.81 0.76 0.45 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000000001 00000000010000000000 Best eval score so far: 0.5946 @ step 16750 epoch 62.50 loss: 0.158719, lagrangian_loss: -0.002629, attention_score_distillation_loss: 0.000000 loss: 0.029461, lagrangian_loss: -0.005411, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:57:46 Evaluating: matthews_correlation: 0.5946, eval_loss: 0.6124, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17150 lambda_1: -1.9011, lambda_2: 126.4684 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.86 0.88 0.81 0.76 0.45 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000001000000 00000000010000000000 Best eval score so far: 0.5946 @ step 16750 epoch 62.50 loss: 0.145117, lagrangian_loss: -0.006732, attention_score_distillation_loss: 0.000000 ETA: 0:46:25 | Epoch 63 finished. Took 71.17 seconds. loss: 0.082681, lagrangian_loss: -0.003258, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:57:58 Evaluating: matthews_correlation: 0.584, eval_loss: 0.62, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17200 lambda_1: -1.3005, lambda_2: 127.0820 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.86 0.88 0.81 0.76 0.45 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000010000 00000000010000000000 Best eval score so far: 0.5946 @ step 16750 epoch 62.50 loss: 0.047669, lagrangian_loss: -0.002227, attention_score_distillation_loss: 0.000000 loss: 0.289601, lagrangian_loss: -0.002077, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:58:11 Evaluating: matthews_correlation: 0.5808, eval_loss: 0.6258, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17250 lambda_1: -0.8342, lambda_2: 127.6250 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.86 0.88 0.81 0.77 0.45 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000001000 00000000010000000000 Best eval score so far: 0.5946 @ step 16750 epoch 62.50 loss: 0.046223, lagrangian_loss: -0.000005, attention_score_distillation_loss: 0.000000 loss: 0.118819, lagrangian_loss: 0.000023, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:58:24 Evaluating: matthews_correlation: 0.5902, eval_loss: 0.6139, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17300 lambda_1: -0.4562, lambda_2: 128.1041 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.86 0.88 0.81 0.77 0.45 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000001000 10000000000000000000 Best eval score so far: 0.5946 @ step 16750 epoch 62.50 loss: 0.052177, lagrangian_loss: 0.001017, attention_score_distillation_loss: 0.000000 loss: 0.007227, lagrangian_loss: -0.000291, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:58:36 Evaluating: matthews_correlation: 0.5844, eval_loss: 0.624, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17350 lambda_1: -0.4343, lambda_2: 128.5357 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.86 0.88 0.81 0.77 0.45 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10011000000000000000 10000000000000000000 Best eval score so far: 0.5946 @ step 16750 epoch 62.50 loss: 0.110712, lagrangian_loss: 0.000831, attention_score_distillation_loss: 0.000000 loss: 0.031267, lagrangian_loss: 0.000073, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:58:48 Evaluating: matthews_correlation: 0.5908, eval_loss: 0.6321, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17400 lambda_1: -0.3312, lambda_2: 129.1461 lambda_3: 0.0000 train remain: [0.97 0.92 0.86 0.86 0.88 0.81 0.77 0.45 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10011000000000000000 10000000000000000000 Best eval score so far: 0.5946 @ step 16750 epoch 62.50 loss: 0.100276, lagrangian_loss: 0.001527, attention_score_distillation_loss: 0.000000 ETA: 0:45:01 | Epoch 64 finished. Took 65.05 seconds. loss: 0.182858, lagrangian_loss: 0.000091, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:59:01 Evaluating: matthews_correlation: 0.5933, eval_loss: 0.6212, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17450 lambda_1: -0.4286, lambda_2: 129.6483 lambda_3: 0.0000 train remain: [0.97 0.92 0.86 0.86 0.88 0.81 0.77 0.45 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10011000000000000000 10000000000000000000 Best eval score so far: 0.5946 @ step 16750 epoch 62.50 loss: 0.013458, lagrangian_loss: -0.000276, attention_score_distillation_loss: 0.000000 loss: 0.215934, lagrangian_loss: 0.002020, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 15:59:13 Evaluating: matthews_correlation: 0.5972, eval_loss: 0.6138, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17500 lambda_1: -0.8528, lambda_2: 130.2421 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.86 0.89 0.81 0.77 0.45 0.14 0.06] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10011000000000000000 10000000000000000000 Best eval score so far: 0.5946 @ step 16750 epoch 62.50 Saving the best model so far: [Epoch 65 | Step: 17500 | MACs sparsity: 0.4825 | Score: 0.5972 | Loss: 0.6138] loss: 0.214754, lagrangian_loss: 0.000160, attention_score_distillation_loss: 0.000000 loss: 0.015219, lagrangian_loss: 0.004378, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:00:00 Evaluating: matthews_correlation: 0.5914, eval_loss: 0.6219, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17550 lambda_1: -1.5550, lambda_2: 130.8717 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.86 0.88 0.81 0.77 0.45 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10011000000000000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.028741, lagrangian_loss: 0.009999, attention_score_distillation_loss: 0.000000 loss: 0.013551, lagrangian_loss: -0.000511, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:00:12 Evaluating: matthews_correlation: 0.5914, eval_loss: 0.6144, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17600 lambda_1: -1.7509, lambda_2: 131.3561 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.86 0.88 0.81 0.77 0.44 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010100000000000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.019040, lagrangian_loss: 0.001655, attention_score_distillation_loss: 0.000000 loss: 0.007933, lagrangian_loss: -0.002664, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:00:25 Evaluating: matthews_correlation: 0.5805, eval_loss: 0.6217, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17650 lambda_1: -1.2108, lambda_2: 132.0335 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.86 0.88 0.81 0.77 0.44 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000010000000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.058539, lagrangian_loss: 0.003889, attention_score_distillation_loss: 0.000000 loss: 0.055770, lagrangian_loss: -0.002155, attention_score_distillation_loss: 0.000000 ETA: 0:43:55 | Epoch 65 finished. Took 99.12 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:00:37 Evaluating: matthews_correlation: 0.5866, eval_loss: 0.6197, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17700 lambda_1: -0.8794, lambda_2: 132.5392 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.86 0.88 0.81 0.77 0.44 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000010000000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.043152, lagrangian_loss: -0.001318, attention_score_distillation_loss: 0.000000 loss: 0.137762, lagrangian_loss: 0.000649, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:00:50 Evaluating: matthews_correlation: 0.5805, eval_loss: 0.6253, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17750 lambda_1: -0.1585, lambda_2: 133.2557 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.87 0.88 0.81 0.77 0.44 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000010000000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.100136, lagrangian_loss: 0.001115, attention_score_distillation_loss: 0.000000 loss: 0.019832, lagrangian_loss: 0.000426, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:01:03 Evaluating: matthews_correlation: 0.5818, eval_loss: 0.6226, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17800 lambda_1: 0.0442, lambda_2: 133.6905 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.87 0.89 0.81 0.77 0.45 0.15 0.06] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000010000000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.206264, lagrangian_loss: 0.001219, attention_score_distillation_loss: 0.000000 loss: 0.176315, lagrangian_loss: 0.000198, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:01:15 Evaluating: matthews_correlation: 0.5777, eval_loss: 0.6194, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17850 lambda_1: -0.1246, lambda_2: 134.2656 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.87 0.89 0.81 0.77 0.45 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000010000000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.218079, lagrangian_loss: -0.000009, attention_score_distillation_loss: 0.000000 loss: 0.068332, lagrangian_loss: 0.003615, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:01:28 Evaluating: matthews_correlation: 0.5837, eval_loss: 0.6215, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17900 lambda_1: -0.2635, lambda_2: 134.6875 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.87 0.89 0.81 0.77 0.45 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000010000000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.302385, lagrangian_loss: -0.000067, attention_score_distillation_loss: 0.000000 loss: 0.487027, lagrangian_loss: 0.003288, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:01:40 Evaluating: matthews_correlation: 0.5833, eval_loss: 0.6133, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 17950 lambda_1: -0.0458, lambda_2: 135.1821 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.87 0.89 0.81 0.78 0.46 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000100000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.008387, lagrangian_loss: 0.001508, attention_score_distillation_loss: 0.000000 ETA: 0:42:34 | Epoch 66 finished. Took 71.03 seconds. loss: 0.464199, lagrangian_loss: 0.002950, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:01:53 Evaluating: matthews_correlation: 0.5808, eval_loss: 0.6305, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 18000 lambda_1: -0.3461, lambda_2: 135.7387 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.87 0.89 0.81 0.78 0.46 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000100000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.022773, lagrangian_loss: 0.001524, attention_score_distillation_loss: 0.000000 loss: 0.123446, lagrangian_loss: -0.000282, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:02:06 Evaluating: matthews_correlation: 0.5752, eval_loss: 0.6262, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 18050 lambda_1: -0.7186, lambda_2: 136.1633 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.87 0.89 0.81 0.78 0.46 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000100000000 00000010000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.308788, lagrangian_loss: 0.000561, attention_score_distillation_loss: 0.000000 loss: 0.027942, lagrangian_loss: -0.001141, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:02:18 Evaluating: matthews_correlation: 0.5774, eval_loss: 0.6234, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 18100 lambda_1: -1.1787, lambda_2: 136.7928 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.87 0.88 0.81 0.79 0.45 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000100000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.019404, lagrangian_loss: 0.010068, attention_score_distillation_loss: 0.000000 loss: 0.150056, lagrangian_loss: 0.002925, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:02:32 Evaluating: matthews_correlation: 0.5774, eval_loss: 0.6245, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 18150 lambda_1: -1.1970, lambda_2: 137.4827 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.87 0.89 0.81 0.78 0.45 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000001000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.055779, lagrangian_loss: 0.001382, attention_score_distillation_loss: 0.000000 loss: 0.243857, lagrangian_loss: -0.001747, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:02:45 Evaluating: matthews_correlation: 0.5866, eval_loss: 0.6141, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 18200 lambda_1: -0.8095, lambda_2: 137.9532 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.87 0.9 0.81 0.78 0.45 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000001000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.143262, lagrangian_loss: -0.000862, attention_score_distillation_loss: 0.000000 ETA: 0:41:12 | Epoch 67 finished. Took 66.75 seconds. loss: 0.039562, lagrangian_loss: -0.000019, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:02:57 Evaluating: matthews_correlation: 0.5869, eval_loss: 0.6119, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 18250 lambda_1: -0.6298, lambda_2: 138.4939 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.88 0.9 0.81 0.79 0.45 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000001000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.071829, lagrangian_loss: 0.001558, attention_score_distillation_loss: 0.000000 loss: 0.051116, lagrangian_loss: 0.000804, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:03:11 Evaluating: matthews_correlation: 0.5799, eval_loss: 0.6147, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 18300 lambda_1: -1.2292, lambda_2: 139.1686 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.88 0.89 0.81 0.78 0.45 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.343276, lagrangian_loss: -0.002135, attention_score_distillation_loss: 0.000000 loss: 0.101283, lagrangian_loss: 0.000566, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:03:24 Evaluating: matthews_correlation: 0.5799, eval_loss: 0.613, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 18350 lambda_1: -1.5445, lambda_2: 139.6728 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.88 0.89 0.81 0.78 0.44 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010010000000000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.031067, lagrangian_loss: -0.001880, attention_score_distillation_loss: 0.000000 loss: 0.072856, lagrangian_loss: -0.003220, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:03:37 Evaluating: matthews_correlation: 0.5779, eval_loss: 0.6335, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 18400 lambda_1: -1.8721, lambda_2: 140.3829 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.88 0.88 0.81 0.78 0.44 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000000010000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.107043, lagrangian_loss: 0.006781, attention_score_distillation_loss: 0.000000 loss: 0.085546, lagrangian_loss: -0.005529, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:03:51 Evaluating: matthews_correlation: 0.5808, eval_loss: 0.6275, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4581, expected_sequence_sparsity: 0.9035, target_sparsity: 0.43, step: 18450 lambda_1: -1.6715, lambda_2: 140.7876 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.88 0.88 0.81 0.78 0.43 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.45, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.14, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110010000010100 10010000000001000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.043976, lagrangian_loss: -0.000785, attention_score_distillation_loss: 0.000000 loss: 0.290037, lagrangian_loss: -0.004044, attention_score_distillation_loss: 0.000000 ETA: 0:39:51 | Epoch 68 finished. Took 68.96 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:04:04 Evaluating: matthews_correlation: 0.576, eval_loss: 0.6184, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 18500 lambda_1: -1.1926, lambda_2: 141.4610 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.88 0.88 0.81 0.78 0.43 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010010000000000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.158316, lagrangian_loss: 0.000616, attention_score_distillation_loss: 0.000000 loss: 0.053201, lagrangian_loss: 0.000434, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:04:17 Evaluating: matthews_correlation: 0.5885, eval_loss: 0.6142, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 18550 lambda_1: -0.5118, lambda_2: 142.1111 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.88 0.88 0.81 0.78 0.43 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010001000000000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.015071, lagrangian_loss: -0.000081, attention_score_distillation_loss: 0.000000 loss: 0.270784, lagrangian_loss: -0.000087, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:04:29 Evaluating: matthews_correlation: 0.592, eval_loss: 0.6141, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 18600 lambda_1: -0.1921, lambda_2: 142.5061 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.88 0.88 0.81 0.78 0.43 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000100000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.034681, lagrangian_loss: 0.002141, attention_score_distillation_loss: 0.000000 loss: 0.225519, lagrangian_loss: 0.000369, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:04:42 Evaluating: matthews_correlation: 0.5895, eval_loss: 0.6112, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 18650 lambda_1: -0.2867, lambda_2: 142.8586 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.89 0.88 0.82 0.78 0.43 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000001000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.089817, lagrangian_loss: 0.000663, attention_score_distillation_loss: 0.000000 loss: 0.336388, lagrangian_loss: 0.000899, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:04:55 Evaluating: matthews_correlation: 0.594, eval_loss: 0.5994, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 18700 lambda_1: -0.8422, lambda_2: 143.4154 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.89 0.88 0.82 0.78 0.42 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000001000000 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 loss: 0.196701, lagrangian_loss: -0.000222, attention_score_distillation_loss: 0.000000 loss: 0.060894, lagrangian_loss: -0.001842, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:05:08 Evaluating: matthews_correlation: 0.6029, eval_loss: 0.605, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 18750 lambda_1: -1.5398, lambda_2: 144.2241 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.89 0.88 0.82 0.78 0.41 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000000000001 10000000000000000000 Best eval score so far: 0.5972 @ step 17500 epoch 65.30 Saving the best model so far: [Epoch 69 | Step: 18750 | MACs sparsity: 0.4825 | Score: 0.6029 | Loss: 0.605] loss: 0.034667, lagrangian_loss: 0.012651, attention_score_distillation_loss: 0.000000 ETA: 0:38:49 | Epoch 69 finished. Took 114.13 seconds. loss: 0.021792, lagrangian_loss: 0.000380, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:06:02 Evaluating: matthews_correlation: 0.5946, eval_loss: 0.6099, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 18800 lambda_1: -1.8825, lambda_2: 144.6591 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.89 0.88 0.82 0.78 0.4 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000000000001 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.359637, lagrangian_loss: 0.002128, attention_score_distillation_loss: 0.000000 loss: 0.009962, lagrangian_loss: -0.004438, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:06:15 Evaluating: matthews_correlation: 0.592, eval_loss: 0.6044, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 18850 lambda_1: -1.3922, lambda_2: 145.2894 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.89 0.88 0.81 0.78 0.4 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000000100000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.108011, lagrangian_loss: -0.000501, attention_score_distillation_loss: 0.000000 loss: 0.031701, lagrangian_loss: 0.005860, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:06:28 Evaluating: matthews_correlation: 0.5859, eval_loss: 0.6099, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 18900 lambda_1: -1.1743, lambda_2: 145.8660 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.89 0.88 0.81 0.78 0.4 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000000000001 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.243038, lagrangian_loss: -0.002066, attention_score_distillation_loss: 0.000000 loss: 0.020455, lagrangian_loss: -0.001039, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:06:40 Evaluating: matthews_correlation: 0.5859, eval_loss: 0.6154, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 18950 lambda_1: -0.9493, lambda_2: 146.3147 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.89 0.88 0.81 0.78 0.4 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000000000001 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.081042, lagrangian_loss: -0.000890, attention_score_distillation_loss: 0.000000 loss: 0.026200, lagrangian_loss: -0.000830, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:06:53 Evaluating: matthews_correlation: 0.5876, eval_loss: 0.6183, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 19000 lambda_1: -0.8276, lambda_2: 146.7684 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.89 0.88 0.81 0.78 0.4 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000001000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.109584, lagrangian_loss: -0.001167, attention_score_distillation_loss: 0.000000 loss: 0.145464, lagrangian_loss: -0.000285, attention_score_distillation_loss: 0.000000 ETA: 0:37:27 | Epoch 70 finished. Took 65.81 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:07:06 Evaluating: matthews_correlation: 0.58, eval_loss: 0.6295, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 19050 lambda_1: -0.5303, lambda_2: 147.4018 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.89 0.88 0.81 0.78 0.39 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000001000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.066431, lagrangian_loss: -0.000264, attention_score_distillation_loss: 0.000000 loss: 0.259460, lagrangian_loss: -0.000119, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:07:18 Evaluating: matthews_correlation: 0.5797, eval_loss: 0.6263, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 19100 lambda_1: -0.4480, lambda_2: 147.8946 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.89 0.88 0.81 0.78 0.4 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000000010000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.264423, lagrangian_loss: 0.010264, attention_score_distillation_loss: 0.000000 loss: 0.030619, lagrangian_loss: 0.002346, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:07:31 Evaluating: matthews_correlation: 0.5902, eval_loss: 0.6231, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 19150 lambda_1: -0.7567, lambda_2: 148.4641 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.88 0.88 0.81 0.77 0.4 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000000010000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.195537, lagrangian_loss: -0.000593, attention_score_distillation_loss: 0.000000 loss: 0.111787, lagrangian_loss: -0.000396, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:07:44 Evaluating: matthews_correlation: 0.6007, eval_loss: 0.6121, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 19200 lambda_1: -0.7331, lambda_2: 148.8848 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.88 0.88 0.81 0.77 0.39 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000000010000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.164899, lagrangian_loss: 0.003602, attention_score_distillation_loss: 0.000000 loss: 0.085057, lagrangian_loss: -0.000485, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:07:56 Evaluating: matthews_correlation: 0.5917, eval_loss: 0.6105, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 19250 lambda_1: -0.5544, lambda_2: 149.2737 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.88 0.88 0.81 0.77 0.39 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000000100000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.020747, lagrangian_loss: 0.005665, attention_score_distillation_loss: 0.000000 loss: 0.042065, lagrangian_loss: 0.001821, attention_score_distillation_loss: 0.000000 ETA: 0:36:05 | Epoch 71 finished. Took 66.02 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:08:09 Evaluating: matthews_correlation: 0.5933, eval_loss: 0.6225, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 19300 lambda_1: -0.5567, lambda_2: 149.8973 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.88 0.88 0.81 0.77 0.39 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10011000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.672440, lagrangian_loss: 0.009495, attention_score_distillation_loss: 0.000000 loss: 0.253537, lagrangian_loss: 0.000548, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:08:22 Evaluating: matthews_correlation: 0.595, eval_loss: 0.6225, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 19350 lambda_1: -0.3434, lambda_2: 150.6752 lambda_3: 0.0000 train remain: [0.96 0.92 0.87 0.88 0.88 0.81 0.77 0.39 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000000010000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.061377, lagrangian_loss: 0.002408, attention_score_distillation_loss: 0.000000 loss: 0.207584, lagrangian_loss: -0.000068, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:08:34 Evaluating: matthews_correlation: 0.5825, eval_loss: 0.6212, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 19400 lambda_1: -0.4099, lambda_2: 151.0429 lambda_3: 0.0000 train remain: [0.96 0.92 0.88 0.87 0.88 0.81 0.77 0.39 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000010000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.079446, lagrangian_loss: 0.007537, attention_score_distillation_loss: 0.000000 loss: 0.108640, lagrangian_loss: 0.005872, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:08:47 Evaluating: matthews_correlation: 0.5802, eval_loss: 0.6277, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 19450 lambda_1: -0.9403, lambda_2: 151.5007 lambda_3: 0.0000 train remain: [0.96 0.92 0.88 0.88 0.88 0.82 0.77 0.39 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10011000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.015447, lagrangian_loss: 0.000683, attention_score_distillation_loss: 0.000000 loss: 0.046458, lagrangian_loss: 0.001758, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:09:00 Evaluating: matthews_correlation: 0.5853, eval_loss: 0.6252, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 19500 lambda_1: -1.2909, lambda_2: 151.8805 lambda_3: 0.0000 train remain: [0.96 0.92 0.88 0.88 0.88 0.82 0.77 0.38 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000010000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.202210, lagrangian_loss: -0.002294, attention_score_distillation_loss: 0.000000 loss: 0.049927, lagrangian_loss: -0.001518, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:09:13 Evaluating: matthews_correlation: 0.5863, eval_loss: 0.6203, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 19550 lambda_1: -1.4977, lambda_2: 152.2561 lambda_3: 0.0000 train remain: [0.96 0.92 0.88 0.88 0.88 0.82 0.77 0.38 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000000100000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.019165, lagrangian_loss: 0.000122, attention_score_distillation_loss: 0.000000 ETA: 0:34:45 | Epoch 72 finished. Took 71.76 seconds. loss: 0.012359, lagrangian_loss: -0.002734, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:09:26 Evaluating: matthews_correlation: 0.5844, eval_loss: 0.6192, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 19600 lambda_1: -1.3368, lambda_2: 152.6499 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.88 0.88 0.82 0.77 0.38 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000010000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.169420, lagrangian_loss: -0.001157, attention_score_distillation_loss: 0.000000 loss: 0.156428, lagrangian_loss: 0.000151, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:09:38 Evaluating: matthews_correlation: 0.5792, eval_loss: 0.6254, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 19650 lambda_1: -1.6037, lambda_2: 153.1390 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.88 0.87 0.81 0.77 0.38 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000000000001 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.231766, lagrangian_loss: 0.008996, attention_score_distillation_loss: 0.000000 loss: 0.031362, lagrangian_loss: 0.017345, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:09:51 Evaluating: matthews_correlation: 0.5796, eval_loss: 0.6242, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 19700 lambda_1: -2.2821, lambda_2: 153.7710 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.88 0.88 0.82 0.77 0.38 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010100 10010000000000100000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.028130, lagrangian_loss: 0.002972, attention_score_distillation_loss: 0.000000 loss: 0.030810, lagrangian_loss: 0.000295, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:10:04 Evaluating: matthews_correlation: 0.5808, eval_loss: 0.6324, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 19750 lambda_1: -2.4119, lambda_2: 154.1336 lambda_3: 0.0000 train remain: [0.96 0.93 0.87 0.87 0.88 0.82 0.77 0.38 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000100000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.059019, lagrangian_loss: 0.010351, attention_score_distillation_loss: 0.000000 loss: 0.275360, lagrangian_loss: -0.002344, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:10:16 Evaluating: matthews_correlation: 0.5786, eval_loss: 0.6256, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 19800 lambda_1: -2.0176, lambda_2: 154.5578 lambda_3: 0.0000 train remain: [0.96 0.93 0.87 0.87 0.87 0.82 0.77 0.38 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000100000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.013401, lagrangian_loss: 0.001592, attention_score_distillation_loss: 0.000000 loss: 0.017288, lagrangian_loss: 0.002294, attention_score_distillation_loss: 0.000000 ETA: 0:33:24 | Epoch 73 finished. Took 66.61 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:10:29 Evaluating: matthews_correlation: 0.5859, eval_loss: 0.626, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 19850 lambda_1: -1.4908, lambda_2: 155.1552 lambda_3: 0.0000 train remain: [0.96 0.93 0.87 0.87 0.87 0.82 0.77 0.38 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000100000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.259658, lagrangian_loss: -0.002599, attention_score_distillation_loss: 0.000000 loss: 0.076504, lagrangian_loss: -0.000798, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:10:42 Evaluating: matthews_correlation: 0.5876, eval_loss: 0.6264, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 19900 lambda_1: -0.7243, lambda_2: 155.8492 lambda_3: 0.0000 train remain: [0.96 0.93 0.88 0.87 0.87 0.82 0.77 0.38 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000100000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.231272, lagrangian_loss: -0.000802, attention_score_distillation_loss: 0.000000 loss: 0.246323, lagrangian_loss: -0.000218, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:10:55 Evaluating: matthews_correlation: 0.5859, eval_loss: 0.6217, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 19950 lambda_1: -0.5278, lambda_2: 156.2787 lambda_3: 0.0000 train remain: [0.96 0.93 0.88 0.87 0.87 0.83 0.77 0.38 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10011000000000000000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.190996, lagrangian_loss: 0.002614, attention_score_distillation_loss: 0.000000 loss: 0.060602, lagrangian_loss: 0.007533, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:11:08 Evaluating: matthews_correlation: 0.5834, eval_loss: 0.6258, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20000 lambda_1: -0.7533, lambda_2: 156.6445 lambda_3: 0.0000 train remain: [0.96 0.92 0.88 0.87 0.87 0.83 0.77 0.38 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000010000000000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.033665, lagrangian_loss: 0.001583, attention_score_distillation_loss: 0.000000 loss: 0.004401, lagrangian_loss: 0.001901, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:11:21 Evaluating: matthews_correlation: 0.5927, eval_loss: 0.6175, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20050 lambda_1: -0.9253, lambda_2: 157.1127 lambda_3: 0.0000 train remain: [0.96 0.92 0.88 0.87 0.87 0.84 0.77 0.38 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000001000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.218944, lagrangian_loss: -0.001245, attention_score_distillation_loss: 0.000000 loss: 0.027596, lagrangian_loss: 0.001344, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:11:34 Evaluating: matthews_correlation: 0.5959, eval_loss: 0.6102, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20100 lambda_1: -0.7559, lambda_2: 157.5514 lambda_3: 0.0000 train remain: [0.96 0.92 0.88 0.87 0.87 0.84 0.77 0.38 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10011000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 ETA: 0:32:06 | Epoch 74 finished. Took 72.55 seconds. loss: 0.172400, lagrangian_loss: 0.001819, attention_score_distillation_loss: 0.000000 loss: 0.048204, lagrangian_loss: 0.000595, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:11:47 Evaluating: matthews_correlation: 0.5866, eval_loss: 0.6049, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20150 lambda_1: -1.2625, lambda_2: 158.1255 lambda_3: 0.0000 train remain: [0.96 0.92 0.89 0.87 0.88 0.83 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10011000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.025719, lagrangian_loss: -0.000018, attention_score_distillation_loss: 0.000000 loss: 0.239117, lagrangian_loss: -0.000082, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:11:59 Evaluating: matthews_correlation: 0.584, eval_loss: 0.6224, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20200 lambda_1: -1.7374, lambda_2: 158.7160 lambda_3: 0.0000 train remain: [0.96 0.92 0.88 0.87 0.87 0.84 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000000001 00000000000001000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.007721, lagrangian_loss: 0.017660, attention_score_distillation_loss: 0.000000 loss: 0.190067, lagrangian_loss: -0.005371, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:12:12 Evaluating: matthews_correlation: 0.58, eval_loss: 0.6241, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20250 lambda_1: -1.7740, lambda_2: 159.0851 lambda_3: 0.0000 train remain: [0.96 0.92 0.88 0.87 0.87 0.84 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000100000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.204213, lagrangian_loss: 0.005383, attention_score_distillation_loss: 0.000000 loss: 0.069411, lagrangian_loss: -0.001983, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:12:25 Evaluating: matthews_correlation: 0.5771, eval_loss: 0.6281, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20300 lambda_1: -1.4942, lambda_2: 159.4783 lambda_3: 0.0000 train remain: [0.96 0.92 0.88 0.87 0.87 0.84 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000000001 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.036020, lagrangian_loss: -0.000327, attention_score_distillation_loss: 0.000000 loss: 0.008325, lagrangian_loss: -0.002779, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:12:38 Evaluating: matthews_correlation: 0.576, eval_loss: 0.6229, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20350 lambda_1: -1.0803, lambda_2: 159.9285 lambda_3: 0.0000 train remain: [0.96 0.92 0.88 0.87 0.87 0.83 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000100000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.009217, lagrangian_loss: -0.001820, attention_score_distillation_loss: 0.000000 ETA: 0:30:45 | Epoch 75 finished. Took 66.32 seconds. loss: 0.092276, lagrangian_loss: -0.000311, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:12:50 Evaluating: matthews_correlation: 0.5831, eval_loss: 0.6306, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20400 lambda_1: -0.4038, lambda_2: 160.7036 lambda_3: 0.0000 train remain: [0.96 0.92 0.88 0.87 0.87 0.84 0.76 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010010000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.041310, lagrangian_loss: -0.000128, attention_score_distillation_loss: 0.000000 loss: 0.023584, lagrangian_loss: 0.000010, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:13:03 Evaluating: matthews_correlation: 0.5821, eval_loss: 0.623, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20450 lambda_1: 0.0238, lambda_2: 161.2276 lambda_3: 0.0000 train remain: [0.96 0.92 0.88 0.87 0.88 0.84 0.77 0.38 0.15 0.06] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10011000000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.231339, lagrangian_loss: 0.000300, attention_score_distillation_loss: 0.000000 loss: 0.018400, lagrangian_loss: 0.000709, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:13:16 Evaluating: matthews_correlation: 0.5859, eval_loss: 0.6271, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4595, expected_sequence_sparsity: 0.9038, target_sparsity: 0.43, step: 20500 lambda_1: -0.0975, lambda_2: 161.7263 lambda_3: 0.0000 train remain: [0.96 0.92 0.88 0.87 0.88 0.84 0.77 0.38 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.4, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.13, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000110000 10011000000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.181419, lagrangian_loss: 0.001999, attention_score_distillation_loss: 0.000000 loss: 0.010264, lagrangian_loss: 0.000395, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:13:29 Evaluating: matthews_correlation: 0.5796, eval_loss: 0.6241, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20550 lambda_1: -0.5860, lambda_2: 162.2646 lambda_3: 0.0000 train remain: [0.96 0.92 0.88 0.87 0.88 0.84 0.77 0.38 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.037115, lagrangian_loss: 0.004749, attention_score_distillation_loss: 0.000000 loss: 0.004303, lagrangian_loss: 0.027003, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:13:41 Evaluating: matthews_correlation: 0.5924, eval_loss: 0.6101, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20600 lambda_1: -1.0668, lambda_2: 162.7968 lambda_3: 0.0000 train remain: [0.96 0.92 0.88 0.87 0.88 0.84 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10011000000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.051572, lagrangian_loss: 0.000981, attention_score_distillation_loss: 0.000000 loss: 0.281975, lagrangian_loss: 0.003311, attention_score_distillation_loss: 0.000000 ETA: 0:29:25 | Epoch 76 finished. Took 66.41 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:13:54 Evaluating: matthews_correlation: 0.5898, eval_loss: 0.5993, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20650 lambda_1: -1.4850, lambda_2: 163.1875 lambda_3: 0.0000 train remain: [0.97 0.92 0.88 0.87 0.87 0.84 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10011000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.124556, lagrangian_loss: 0.005777, attention_score_distillation_loss: 0.000000 loss: 0.060491, lagrangian_loss: 0.023167, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:14:07 Evaluating: matthews_correlation: 0.5789, eval_loss: 0.6242, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20700 lambda_1: -1.7944, lambda_2: 163.6150 lambda_3: 0.0000 train remain: [0.97 0.92 0.88 0.87 0.87 0.83 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000100000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.020012, lagrangian_loss: -0.000667, attention_score_distillation_loss: 0.000000 loss: 0.131611, lagrangian_loss: 0.003008, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:14:20 Evaluating: matthews_correlation: 0.5908, eval_loss: 0.6278, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20750 lambda_1: -1.5378, lambda_2: 164.0373 lambda_3: 0.0000 train remain: [0.97 0.92 0.88 0.87 0.87 0.83 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000000001 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.064685, lagrangian_loss: -0.003553, attention_score_distillation_loss: 0.000000 loss: 0.120226, lagrangian_loss: -0.002368, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:14:32 Evaluating: matthews_correlation: 0.5882, eval_loss: 0.627, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20800 lambda_1: -0.8944, lambda_2: 164.6233 lambda_3: 0.0000 train remain: [0.97 0.92 0.88 0.87 0.87 0.83 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000100000 00000000000000100000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.169607, lagrangian_loss: -0.000910, attention_score_distillation_loss: 0.000000 loss: 0.146437, lagrangian_loss: -0.000616, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:14:45 Evaluating: matthews_correlation: 0.5818, eval_loss: 0.6295, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20850 lambda_1: -0.4765, lambda_2: 165.0365 lambda_3: 0.0000 train remain: [0.97 0.93 0.88 0.87 0.87 0.83 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000100000 00000000000000100000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.118940, lagrangian_loss: 0.005398, attention_score_distillation_loss: 0.000000 loss: 0.038677, lagrangian_loss: 0.000173, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:14:58 Evaluating: matthews_correlation: 0.5895, eval_loss: 0.6257, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20900 lambda_1: -0.9182, lambda_2: 165.6075 lambda_3: 0.0000 train remain: [0.97 0.93 0.88 0.87 0.87 0.83 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000010000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.094110, lagrangian_loss: 0.002923, attention_score_distillation_loss: 0.000000 ETA: 0:28:07 | Epoch 77 finished. Took 71.71 seconds. loss: 0.169177, lagrangian_loss: 0.000287, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:15:10 Evaluating: matthews_correlation: 0.5882, eval_loss: 0.6291, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 20950 lambda_1: -1.5067, lambda_2: 166.2820 lambda_3: 0.0000 train remain: [0.97 0.93 0.88 0.87 0.87 0.83 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000000001 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.076116, lagrangian_loss: 0.000050, attention_score_distillation_loss: 0.000000 loss: 0.017879, lagrangian_loss: 0.001109, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:15:23 Evaluating: matthews_correlation: 0.5834, eval_loss: 0.6211, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21000 lambda_1: -1.3508, lambda_2: 166.7198 lambda_3: 0.0000 train remain: [0.97 0.92 0.88 0.86 0.87 0.83 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000010000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.313988, lagrangian_loss: -0.002655, attention_score_distillation_loss: 0.000000 loss: 0.038071, lagrangian_loss: -0.001273, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:15:36 Evaluating: matthews_correlation: 0.5818, eval_loss: 0.6226, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21050 lambda_1: -1.0764, lambda_2: 167.2003 lambda_3: 0.0000 train remain: [0.97 0.92 0.87 0.86 0.87 0.83 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000100000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.058795, lagrangian_loss: 0.000778, attention_score_distillation_loss: 0.000000 loss: 0.059330, lagrangian_loss: -0.000906, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:15:49 Evaluating: matthews_correlation: 0.5844, eval_loss: 0.6191, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21100 lambda_1: -0.5958, lambda_2: 167.8828 lambda_3: 0.0000 train remain: [0.97 0.92 0.88 0.86 0.87 0.83 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000100000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.008489, lagrangian_loss: -0.000068, attention_score_distillation_loss: 0.000000 loss: 0.058378, lagrangian_loss: 0.000160, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:16:01 Evaluating: matthews_correlation: 0.5803, eval_loss: 0.6221, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21150 lambda_1: -0.4352, lambda_2: 168.2847 lambda_3: 0.0000 train remain: [0.97 0.93 0.88 0.86 0.87 0.83 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000100000000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.167472, lagrangian_loss: 0.000067, attention_score_distillation_loss: 0.000000 ETA: 0:26:48 | Epoch 78 finished. Took 65.94 seconds. loss: 0.007520, lagrangian_loss: 0.002906, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:16:14 Evaluating: matthews_correlation: 0.584, eval_loss: 0.6214, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21200 lambda_1: -0.8610, lambda_2: 169.0285 lambda_3: 0.0000 train remain: [0.97 0.93 0.88 0.86 0.87 0.83 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010010000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.124438, lagrangian_loss: 0.002737, attention_score_distillation_loss: 0.000000 loss: 0.062092, lagrangian_loss: 0.004208, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:16:26 Evaluating: matthews_correlation: 0.5808, eval_loss: 0.624, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21250 lambda_1: -1.1343, lambda_2: 169.4753 lambda_3: 0.0000 train remain: [0.97 0.93 0.88 0.86 0.87 0.82 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000000001 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.042116, lagrangian_loss: 0.000210, attention_score_distillation_loss: 0.000000 loss: 0.025438, lagrangian_loss: 0.000720, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:16:39 Evaluating: matthews_correlation: 0.5818, eval_loss: 0.6232, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21300 lambda_1: -1.0065, lambda_2: 169.8569 lambda_3: 0.0000 train remain: [0.97 0.93 0.88 0.86 0.87 0.82 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000000001 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.017142, lagrangian_loss: -0.000738, attention_score_distillation_loss: 0.000000 loss: 0.047604, lagrangian_loss: -0.000186, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:16:52 Evaluating: matthews_correlation: 0.5831, eval_loss: 0.6216, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21350 lambda_1: -0.9444, lambda_2: 170.3288 lambda_3: 0.0000 train remain: [0.97 0.93 0.88 0.87 0.87 0.82 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000010000000000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.119019, lagrangian_loss: -0.000889, attention_score_distillation_loss: 0.000000 loss: 0.123836, lagrangian_loss: 0.007730, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:17:04 Evaluating: matthews_correlation: 0.5866, eval_loss: 0.6173, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21400 lambda_1: -1.4159, lambda_2: 170.8064 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.87 0.87 0.82 0.77 0.37 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000010000000000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.014601, lagrangian_loss: 0.000905, attention_score_distillation_loss: 0.000000 loss: 0.095810, lagrangian_loss: -0.001970, attention_score_distillation_loss: 0.000000 ETA: 0:25:28 | Epoch 79 finished. Took 65.9 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:17:17 Evaluating: matthews_correlation: 0.5766, eval_loss: 0.6248, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21450 lambda_1: -2.0105, lambda_2: 171.2837 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.87 0.87 0.83 0.76 0.36 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000100000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.210839, lagrangian_loss: -0.005132, attention_score_distillation_loss: 0.000000 loss: 0.040845, lagrangian_loss: 0.006219, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:17:30 Evaluating: matthews_correlation: 0.5917, eval_loss: 0.6072, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21500 lambda_1: -2.1205, lambda_2: 171.8880 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.87 0.87 0.83 0.76 0.36 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10011000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.022149, lagrangian_loss: -0.003933, attention_score_distillation_loss: 0.000000 loss: 0.126897, lagrangian_loss: -0.002897, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:17:43 Evaluating: matthews_correlation: 0.5777, eval_loss: 0.6107, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21550 lambda_1: -1.9542, lambda_2: 172.3894 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.87 0.87 0.82 0.76 0.36 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000001000000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.216445, lagrangian_loss: 0.001493, attention_score_distillation_loss: 0.000000 loss: 0.124292, lagrangian_loss: 0.002374, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:17:55 Evaluating: matthews_correlation: 0.5825, eval_loss: 0.6183, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21600 lambda_1: -1.3494, lambda_2: 173.0456 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.87 0.87 0.82 0.76 0.36 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000000001 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.136794, lagrangian_loss: -0.001577, attention_score_distillation_loss: 0.000000 loss: 0.008647, lagrangian_loss: 0.000425, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:18:08 Evaluating: matthews_correlation: 0.5888, eval_loss: 0.6156, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21650 lambda_1: -0.8349, lambda_2: 173.5569 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.87 0.87 0.82 0.76 0.36 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000100000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.012690, lagrangian_loss: -0.001004, attention_score_distillation_loss: 0.000000 loss: 0.085242, lagrangian_loss: -0.000967, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:18:21 Evaluating: matthews_correlation: 0.5818, eval_loss: 0.6166, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21700 lambda_1: -0.6515, lambda_2: 174.0110 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.87 0.87 0.83 0.76 0.35 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000100000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.239935, lagrangian_loss: 0.001545, attention_score_distillation_loss: 0.000000 ETA: 0:24:11 | Epoch 80 finished. Took 71.98 seconds. loss: 0.010728, lagrangian_loss: 0.003140, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:18:34 Evaluating: matthews_correlation: 0.5821, eval_loss: 0.6192, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21750 lambda_1: -0.9610, lambda_2: 174.5330 lambda_3: 0.0000 train remain: [0.97 0.94 0.87 0.87 0.87 0.83 0.76 0.35 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000000001 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.049984, lagrangian_loss: -0.001062, attention_score_distillation_loss: 0.000000 loss: 0.298746, lagrangian_loss: -0.000947, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:18:47 Evaluating: matthews_correlation: 0.5859, eval_loss: 0.619, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21800 lambda_1: -1.6833, lambda_2: 175.2163 lambda_3: 0.0000 train remain: [0.97 0.94 0.87 0.87 0.87 0.82 0.76 0.35 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000010000000000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.030570, lagrangian_loss: 0.004245, attention_score_distillation_loss: 0.000000 loss: 0.159286, lagrangian_loss: 0.000563, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:19:00 Evaluating: matthews_correlation: 0.5821, eval_loss: 0.6062, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21850 lambda_1: -2.1329, lambda_2: 175.7434 lambda_3: 0.0000 train remain: [0.97 0.94 0.87 0.87 0.87 0.82 0.76 0.35 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10011000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.150960, lagrangian_loss: 0.001440, attention_score_distillation_loss: 0.000000 loss: 0.183934, lagrangian_loss: -0.003698, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:19:12 Evaluating: matthews_correlation: 0.5844, eval_loss: 0.6113, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21900 lambda_1: -1.8271, lambda_2: 176.0445 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.87 0.87 0.82 0.76 0.35 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10011000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.114251, lagrangian_loss: -0.004584, attention_score_distillation_loss: 0.000000 loss: 0.160982, lagrangian_loss: 0.000232, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:19:25 Evaluating: matthews_correlation: 0.5821, eval_loss: 0.6221, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 21950 lambda_1: -1.4147, lambda_2: 176.5463 lambda_3: 0.0000 train remain: [0.97 0.94 0.86 0.87 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010010000000000000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.022602, lagrangian_loss: 0.001206, attention_score_distillation_loss: 0.000000 loss: 0.014834, lagrangian_loss: 0.007260, attention_score_distillation_loss: 0.000000 ETA: 0:22:52 | Epoch 81 finished. Took 66.92 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:19:38 Evaluating: matthews_correlation: 0.5885, eval_loss: 0.6169, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22000 lambda_1: -1.2472, lambda_2: 177.1526 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.87 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010010000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.050783, lagrangian_loss: 0.007709, attention_score_distillation_loss: 0.000000 loss: 0.137196, lagrangian_loss: 0.002630, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:19:51 Evaluating: matthews_correlation: 0.5962, eval_loss: 0.6161, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22050 lambda_1: -0.9095, lambda_2: 177.6341 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.87 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010010000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.030667, lagrangian_loss: -0.000408, attention_score_distillation_loss: 0.000000 loss: 0.014665, lagrangian_loss: 0.000661, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:20:04 Evaluating: matthews_correlation: 0.5888, eval_loss: 0.623, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22100 lambda_1: -0.6983, lambda_2: 178.0552 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.87 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000100000 00000000000001000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.012963, lagrangian_loss: 0.000161, attention_score_distillation_loss: 0.000000 loss: 0.066155, lagrangian_loss: -0.000090, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:20:17 Evaluating: matthews_correlation: 0.5965, eval_loss: 0.618, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22150 lambda_1: -0.5488, lambda_2: 178.5298 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.87 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000001000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.156191, lagrangian_loss: 0.004718, attention_score_distillation_loss: 0.000000 loss: 0.029455, lagrangian_loss: 0.003727, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:20:30 Evaluating: matthews_correlation: 0.5939, eval_loss: 0.6166, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22200 lambda_1: -0.3241, lambda_2: 179.1601 lambda_3: 0.0000 train remain: [0.97 0.94 0.86 0.87 0.88 0.82 0.76 0.34 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010010000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.556574, lagrangian_loss: 0.000004, attention_score_distillation_loss: 0.000000 loss: 0.421217, lagrangian_loss: 0.000602, attention_score_distillation_loss: 0.000000 ETA: 0:21:34 | Epoch 82 finished. Took 67.54 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:20:43 Evaluating: matthews_correlation: 0.595, eval_loss: 0.623, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22250 lambda_1: -0.1996, lambda_2: 179.5799 lambda_3: 0.0000 train remain: [0.97 0.94 0.86 0.87 0.88 0.82 0.76 0.34 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010010000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.025678, lagrangian_loss: 0.000445, attention_score_distillation_loss: 0.000000 loss: 0.189345, lagrangian_loss: 0.002198, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:20:56 Evaluating: matthews_correlation: 0.5879, eval_loss: 0.625, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22300 lambda_1: -0.2317, lambda_2: 180.0176 lambda_3: 0.0000 train remain: [0.97 0.94 0.86 0.87 0.88 0.82 0.76 0.34 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010010000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.299253, lagrangian_loss: 0.002240, attention_score_distillation_loss: 0.000000 loss: 0.019911, lagrangian_loss: 0.003314, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:21:09 Evaluating: matthews_correlation: 0.5905, eval_loss: 0.6244, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22350 lambda_1: -0.8609, lambda_2: 180.8837 lambda_3: 0.0000 train remain: [0.97 0.94 0.86 0.87 0.89 0.82 0.76 0.34 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010010000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.042423, lagrangian_loss: 0.001753, attention_score_distillation_loss: 0.000000 loss: 0.029951, lagrangian_loss: 0.000824, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:21:22 Evaluating: matthews_correlation: 0.5846, eval_loss: 0.6321, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22400 lambda_1: -1.6509, lambda_2: 181.6889 lambda_3: 0.0000 train remain: [0.97 0.94 0.86 0.87 0.88 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010010000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.097265, lagrangian_loss: 0.001265, attention_score_distillation_loss: 0.000000 loss: 0.024130, lagrangian_loss: -0.003422, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:21:35 Evaluating: matthews_correlation: 0.5888, eval_loss: 0.6285, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22450 lambda_1: -2.1182, lambda_2: 182.3775 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.87 0.88 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000000001 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.049159, lagrangian_loss: 0.004117, attention_score_distillation_loss: 0.000000 loss: 0.061096, lagrangian_loss: 0.003898, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:21:47 Evaluating: matthews_correlation: 0.5863, eval_loss: 0.6194, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22500 lambda_1: -2.2230, lambda_2: 182.7886 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.87 0.88 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10011000000000000000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.039482, lagrangian_loss: -0.006138, attention_score_distillation_loss: 0.000000 ETA: 0:20:17 | Epoch 83 finished. Took 72.62 seconds. loss: 0.220004, lagrangian_loss: -0.001000, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:22:00 Evaluating: matthews_correlation: 0.593, eval_loss: 0.6285, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22550 lambda_1: -1.7726, lambda_2: 183.2902 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.87 0.88 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000000001 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.242995, lagrangian_loss: -0.000830, attention_score_distillation_loss: 0.000000 loss: 0.080609, lagrangian_loss: -0.003003, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:22:13 Evaluating: matthews_correlation: 0.5837, eval_loss: 0.6297, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22600 lambda_1: -1.1563, lambda_2: 183.9532 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.87 0.88 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000100000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.015018, lagrangian_loss: 0.004818, attention_score_distillation_loss: 0.000000 loss: 0.229260, lagrangian_loss: -0.000502, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:22:26 Evaluating: matthews_correlation: 0.5882, eval_loss: 0.6327, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22650 lambda_1: -0.4860, lambda_2: 184.7347 lambda_3: 0.0000 train remain: [0.97 0.94 0.86 0.87 0.88 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000100000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.573281, lagrangian_loss: -0.000304, attention_score_distillation_loss: 0.000000 loss: 0.030112, lagrangian_loss: 0.008021, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:22:38 Evaluating: matthews_correlation: 0.5933, eval_loss: 0.6215, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22700 lambda_1: -0.8402, lambda_2: 185.4338 lambda_3: 0.0000 train remain: [0.97 0.94 0.86 0.87 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.051533, lagrangian_loss: 0.003525, attention_score_distillation_loss: 0.000000 loss: 0.050956, lagrangian_loss: 0.000247, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:22:51 Evaluating: matthews_correlation: 0.5911, eval_loss: 0.6276, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22750 lambda_1: -1.2099, lambda_2: 185.9099 lambda_3: 0.0000 train remain: [0.97 0.94 0.86 0.87 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000000001 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.191008, lagrangian_loss: 0.001891, attention_score_distillation_loss: 0.000000 loss: 0.041092, lagrangian_loss: -0.001956, attention_score_distillation_loss: 0.000000 ETA: 0:19:00 | Epoch 84 finished. Took 66.15 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:23:04 Evaluating: matthews_correlation: 0.5911, eval_loss: 0.6157, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22800 lambda_1: -1.2990, lambda_2: 186.3170 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.88 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000001000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.233546, lagrangian_loss: -0.001856, attention_score_distillation_loss: 0.000000 loss: 0.023553, lagrangian_loss: 0.008391, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:23:16 Evaluating: matthews_correlation: 0.5885, eval_loss: 0.6145, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22850 lambda_1: -1.5640, lambda_2: 186.7458 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.88 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10011000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.052950, lagrangian_loss: -0.001523, attention_score_distillation_loss: 0.000000 loss: 0.119407, lagrangian_loss: -0.000445, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:23:29 Evaluating: matthews_correlation: 0.5837, eval_loss: 0.6218, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22900 lambda_1: -1.1876, lambda_2: 187.2977 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.88 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010010000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.025239, lagrangian_loss: -0.001493, attention_score_distillation_loss: 0.000000 loss: 0.051760, lagrangian_loss: -0.000877, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:23:42 Evaluating: matthews_correlation: 0.5786, eval_loss: 0.6232, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 22950 lambda_1: -0.6601, lambda_2: 187.8782 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.88 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000010000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.094439, lagrangian_loss: -0.000532, attention_score_distillation_loss: 0.000000 loss: 0.214106, lagrangian_loss: 0.000101, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:23:54 Evaluating: matthews_correlation: 0.5808, eval_loss: 0.6247, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 23000 lambda_1: -0.4005, lambda_2: 188.3206 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.88 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000010000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.218071, lagrangian_loss: -0.000211, attention_score_distillation_loss: 0.000000 loss: 0.054240, lagrangian_loss: -0.000051, attention_score_distillation_loss: 0.000000 ETA: 0:17:42 | Epoch 85 finished. Took 65.96 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:24:07 Evaluating: matthews_correlation: 0.5882, eval_loss: 0.6302, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 23050 lambda_1: -0.1087, lambda_2: 188.7380 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.89 0.87 0.82 0.76 0.34 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.225837, lagrangian_loss: 0.000287, attention_score_distillation_loss: 0.000000 loss: 0.465706, lagrangian_loss: 0.004806, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:24:20 Evaluating: matthews_correlation: 0.5834, eval_loss: 0.634, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 23100 lambda_1: -0.3415, lambda_2: 189.1549 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.89 0.87 0.82 0.76 0.34 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.289799, lagrangian_loss: -0.000151, attention_score_distillation_loss: 0.000000 loss: 0.019288, lagrangian_loss: -0.000259, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:24:32 Evaluating: matthews_correlation: 0.5837, eval_loss: 0.63, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 23150 lambda_1: -0.8483, lambda_2: 189.8616 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.89 0.87 0.82 0.76 0.34 0.15 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010010000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.022326, lagrangian_loss: -0.000844, attention_score_distillation_loss: 0.000000 loss: 0.018309, lagrangian_loss: 0.000578, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:24:45 Evaluating: matthews_correlation: 0.5882, eval_loss: 0.6324, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 23200 lambda_1: -0.9044, lambda_2: 190.2441 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.88 0.87 0.81 0.76 0.33 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010010000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.009390, lagrangian_loss: -0.000652, attention_score_distillation_loss: 0.000000 loss: 0.048259, lagrangian_loss: 0.006516, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:24:58 Evaluating: matthews_correlation: 0.584, eval_loss: 0.6242, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 23250 lambda_1: -0.6382, lambda_2: 190.8030 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.89 0.87 0.81 0.76 0.33 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010010000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.026084, lagrangian_loss: 0.000084, attention_score_distillation_loss: 0.000000 loss: 0.449701, lagrangian_loss: -0.000265, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:25:10 Evaluating: matthews_correlation: 0.5885, eval_loss: 0.6197, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 23300 lambda_1: -0.5406, lambda_2: 191.1336 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.89 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000011000 10010010000000000000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.197908, lagrangian_loss: -0.000363, attention_score_distillation_loss: 0.000000 ETA: 0:16:25 | Epoch 86 finished. Took 71.22 seconds. loss: 0.100796, lagrangian_loss: -0.000656, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:25:23 Evaluating: matthews_correlation: 0.5818, eval_loss: 0.6232, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 23350 lambda_1: -0.9033, lambda_2: 191.5916 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.89 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010010000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.074704, lagrangian_loss: -0.000499, attention_score_distillation_loss: 0.000000 loss: 0.065042, lagrangian_loss: -0.001085, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:25:36 Evaluating: matthews_correlation: 0.5895, eval_loss: 0.6138, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 23400 lambda_1: -0.9597, lambda_2: 192.1056 lambda_3: 0.0000 train remain: [0.97 0.93 0.86 0.88 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111101000000010000 10011000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.096253, lagrangian_loss: -0.000793, attention_score_distillation_loss: 0.000000 loss: 0.263042, lagrangian_loss: 0.003988, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:25:48 Evaluating: matthews_correlation: 0.5891, eval_loss: 0.6157, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 23450 lambda_1: -0.6736, lambda_2: 192.6604 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.89 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111101000000010000 10010000000000100000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.047676, lagrangian_loss: 0.001411, attention_score_distillation_loss: 0.000000 loss: 0.090241, lagrangian_loss: -0.000511, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:26:01 Evaluating: matthews_correlation: 0.593, eval_loss: 0.6209, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 23500 lambda_1: -0.9425, lambda_2: 193.0134 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.89 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111101000000010000 10010000010000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.015395, lagrangian_loss: 0.002663, attention_score_distillation_loss: 0.000000 loss: 0.040318, lagrangian_loss: -0.001279, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:26:14 Evaluating: matthews_correlation: 0.5837, eval_loss: 0.6287, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4825, expected_sparsity: 0.4608, expected_sequence_sparsity: 0.904, target_sparsity: 0.43, step: 23550 lambda_1: -1.1020, lambda_2: 193.4788 lambda_3: 0.0000 train remain: [0.97 0.93 0.87 0.88 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [0.95, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 0.95, 0.86, 0.73, 0.62, 0.53, 0.42, 0.32, 0.11, 0.02, 0.0] 11111111111111111110 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000100010000 10010000010000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.141314, lagrangian_loss: 0.001930, attention_score_distillation_loss: 0.000000 loss: 0.015742, lagrangian_loss: 0.002273, attention_score_distillation_loss: 0.000000 ETA: 0:15:08 | Epoch 87 finished. Took 65.94 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:26:26 Evaluating: matthews_correlation: 0.5734, eval_loss: 0.6349, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 23600 lambda_1: -1.3363, lambda_2: 194.0670 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000100010000 10010010000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.392692, lagrangian_loss: -0.002299, attention_score_distillation_loss: 0.000000 loss: 0.032521, lagrangian_loss: -0.001527, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:26:39 Evaluating: matthews_correlation: 0.5754, eval_loss: 0.6405, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 23650 lambda_1: -1.2154, lambda_2: 194.4242 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000100010000 10010000000000000001 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.184830, lagrangian_loss: -0.001881, attention_score_distillation_loss: 0.000000 loss: 0.010754, lagrangian_loss: -0.000396, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:26:52 Evaluating: matthews_correlation: 0.5805, eval_loss: 0.642, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 23700 lambda_1: -1.6542, lambda_2: 194.9469 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010001 10010001000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.117678, lagrangian_loss: -0.003238, attention_score_distillation_loss: 0.000000 loss: 0.014898, lagrangian_loss: 0.001270, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:27:04 Evaluating: matthews_correlation: 0.5805, eval_loss: 0.6376, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 23750 lambda_1: -1.9312, lambda_2: 195.5536 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.88 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010001 10010001000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.020420, lagrangian_loss: 0.006656, attention_score_distillation_loss: 0.000000 loss: 0.032340, lagrangian_loss: 0.003902, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:27:17 Evaluating: matthews_correlation: 0.5853, eval_loss: 0.6447, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 23800 lambda_1: -1.9368, lambda_2: 195.9369 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.88 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 11111100000000010000 10010010000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.011829, lagrangian_loss: 0.000748, attention_score_distillation_loss: 0.000000 loss: 0.060067, lagrangian_loss: 0.002231, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:27:29 Evaluating: matthews_correlation: 0.5811, eval_loss: 0.6268, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 23850 lambda_1: -2.2181, lambda_2: 196.5110 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.88 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111101000000010000 10010010000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.028840, lagrangian_loss: 0.014347, attention_score_distillation_loss: 0.000000 ETA: 0:13:52 | Epoch 88 finished. Took 71.08 seconds. loss: 0.135645, lagrangian_loss: -0.005433, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:27:42 Evaluating: matthews_correlation: 0.5782, eval_loss: 0.6346, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 23900 lambda_1: -2.0728, lambda_2: 197.1075 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.87 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010100 10011000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.029313, lagrangian_loss: 0.002943, attention_score_distillation_loss: 0.000000 loss: 0.162682, lagrangian_loss: -0.003863, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:27:54 Evaluating: matthews_correlation: 0.5777, eval_loss: 0.6358, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 23950 lambda_1: -1.7293, lambda_2: 197.7333 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.87 0.87 0.83 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010100 10011000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.036246, lagrangian_loss: 0.008222, attention_score_distillation_loss: 0.000000 loss: 0.045857, lagrangian_loss: -0.002212, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:28:07 Evaluating: matthews_correlation: 0.5831, eval_loss: 0.64, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24000 lambda_1: -1.4609, lambda_2: 198.4162 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.88 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010100 10010000000000000001 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.035892, lagrangian_loss: 0.001833, attention_score_distillation_loss: 0.000000 loss: 0.037132, lagrangian_loss: 0.003350, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:28:20 Evaluating: matthews_correlation: 0.5828, eval_loss: 0.6397, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24050 lambda_1: -1.4780, lambda_2: 198.8125 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.88 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010001 10010000000001000000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.076435, lagrangian_loss: -0.002566, attention_score_distillation_loss: 0.000000 loss: 0.014102, lagrangian_loss: -0.002069, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:28:32 Evaluating: matthews_correlation: 0.5828, eval_loss: 0.6337, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24100 lambda_1: -1.1882, lambda_2: 199.3862 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010001 10010000000001000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.236032, lagrangian_loss: -0.001457, attention_score_distillation_loss: 0.000000 ETA: 0:12:35 | Epoch 89 finished. Took 65.63 seconds. loss: 0.033546, lagrangian_loss: -0.001967, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:28:45 Evaluating: matthews_correlation: 0.5808, eval_loss: 0.6351, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24150 lambda_1: -1.4627, lambda_2: 200.1011 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.89 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010001 10010000000001000000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.014651, lagrangian_loss: 0.007806, attention_score_distillation_loss: 0.000000 loss: 0.049407, lagrangian_loss: -0.001225, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:28:58 Evaluating: matthews_correlation: 0.5856, eval_loss: 0.6407, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24200 lambda_1: -1.4530, lambda_2: 200.5483 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.89 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000000001 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.398319, lagrangian_loss: 0.003076, attention_score_distillation_loss: 0.000000 loss: 0.086597, lagrangian_loss: -0.002415, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:29:10 Evaluating: matthews_correlation: 0.5856, eval_loss: 0.6363, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24250 lambda_1: -1.5506, lambda_2: 201.1649 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.88 0.88 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010100 10010000010000000000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.006712, lagrangian_loss: 0.014017, attention_score_distillation_loss: 0.000000 loss: 0.079372, lagrangian_loss: 0.004253, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:29:23 Evaluating: matthews_correlation: 0.5856, eval_loss: 0.6336, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24300 lambda_1: -1.6542, lambda_2: 201.8755 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.88 0.88 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010100 10010000000000000001 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.231646, lagrangian_loss: -0.001547, attention_score_distillation_loss: 0.000000 loss: 0.028874, lagrangian_loss: -0.002752, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:29:36 Evaluating: matthews_correlation: 0.576, eval_loss: 0.6356, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24350 lambda_1: -1.3743, lambda_2: 202.5701 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.88 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000110000 10010000000001000000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.014626, lagrangian_loss: 0.005035, attention_score_distillation_loss: 0.000000 loss: 0.115948, lagrangian_loss: -0.000890, attention_score_distillation_loss: 0.000000 ETA: 0:11:19 | Epoch 90 finished. Took 65.98 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:29:48 Evaluating: matthews_correlation: 0.5856, eval_loss: 0.6402, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24400 lambda_1: -0.9518, lambda_2: 203.1413 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.88 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 11111100000000010000 10010000000001000000 00000000000000000001 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.029692, lagrangian_loss: -0.000781, attention_score_distillation_loss: 0.000000 loss: 0.119554, lagrangian_loss: -0.000224, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:30:01 Evaluating: matthews_correlation: 0.5834, eval_loss: 0.6366, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24450 lambda_1: -0.3620, lambda_2: 203.7373 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.88 0.87 0.82 0.76 0.33 0.15 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100100000010000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.046058, lagrangian_loss: -0.000011, attention_score_distillation_loss: 0.000000 loss: 0.012105, lagrangian_loss: 0.000787, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:30:14 Evaluating: matthews_correlation: 0.5856, eval_loss: 0.6392, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24500 lambda_1: -0.0666, lambda_2: 204.3164 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.88 0.87 0.82 0.76 0.34 0.15 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 11111100000000010000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.018035, lagrangian_loss: 0.000653, attention_score_distillation_loss: 0.000000 loss: 0.032226, lagrangian_loss: 0.000601, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:30:26 Evaluating: matthews_correlation: 0.5831, eval_loss: 0.6413, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24550 lambda_1: -0.0737, lambda_2: 204.5701 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.34 0.15 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 11111100000000010000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.020228, lagrangian_loss: 0.001437, attention_score_distillation_loss: 0.000000 loss: 0.041353, lagrangian_loss: 0.001946, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:30:39 Evaluating: matthews_correlation: 0.5851, eval_loss: 0.6423, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24600 lambda_1: -0.3001, lambda_2: 205.1508 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.33 0.15 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 11111100000000010000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.086059, lagrangian_loss: 0.002406, attention_score_distillation_loss: 0.000000 loss: 0.036773, lagrangian_loss: 0.002456, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:30:52 Evaluating: matthews_correlation: 0.5856, eval_loss: 0.6405, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24650 lambda_1: -0.2836, lambda_2: 205.6676 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.33 0.15 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 11111100000000010000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.071862, lagrangian_loss: 0.000029, attention_score_distillation_loss: 0.000000 ETA: 0:10:03 | Epoch 91 finished. Took 71.21 seconds. loss: 0.030332, lagrangian_loss: 0.003315, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:31:04 Evaluating: matthews_correlation: 0.5718, eval_loss: 0.6417, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24700 lambda_1: -0.3838, lambda_2: 206.1997 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.33 0.15 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010001000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.027727, lagrangian_loss: 0.000402, attention_score_distillation_loss: 0.000000 loss: 0.058851, lagrangian_loss: -0.000153, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:31:17 Evaluating: matthews_correlation: 0.5667, eval_loss: 0.6343, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24750 lambda_1: -0.4612, lambda_2: 206.6900 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.89 0.87 0.82 0.76 0.34 0.15 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000001010000 10010001000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.145557, lagrangian_loss: 0.001983, attention_score_distillation_loss: 0.000000 loss: 0.013255, lagrangian_loss: 0.000840, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:31:30 Evaluating: matthews_correlation: 0.5821, eval_loss: 0.6304, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24800 lambda_1: -0.6382, lambda_2: 207.1483 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100100000010000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.400824, lagrangian_loss: 0.001468, attention_score_distillation_loss: 0.000000 loss: 0.009981, lagrangian_loss: 0.010629, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:31:42 Evaluating: matthews_correlation: 0.5731, eval_loss: 0.6316, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24850 lambda_1: -0.9240, lambda_2: 207.7007 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000100010000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.011006, lagrangian_loss: 0.000719, attention_score_distillation_loss: 0.000000 loss: 0.068790, lagrangian_loss: 0.006677, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:31:55 Evaluating: matthews_correlation: 0.5737, eval_loss: 0.6349, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24900 lambda_1: -0.7381, lambda_2: 208.1998 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000011000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.034744, lagrangian_loss: -0.000554, attention_score_distillation_loss: 0.000000 ETA: 0:08:47 | Epoch 92 finished. Took 65.88 seconds. loss: 0.045899, lagrangian_loss: 0.000708, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:32:08 Evaluating: matthews_correlation: 0.5741, eval_loss: 0.627, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 24950 lambda_1: -0.4978, lambda_2: 208.6094 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000011000 10010000000001000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.043897, lagrangian_loss: 0.000052, attention_score_distillation_loss: 0.000000 loss: 0.019491, lagrangian_loss: 0.000706, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:32:20 Evaluating: matthews_correlation: 0.5811, eval_loss: 0.6424, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25000 lambda_1: -0.0742, lambda_2: 209.1804 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.34 0.15 0.06] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 11111100000000010000 10010000000001000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.041249, lagrangian_loss: 0.000266, attention_score_distillation_loss: 0.000000 loss: 0.019549, lagrangian_loss: 0.000111, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:32:33 Evaluating: matthews_correlation: 0.576, eval_loss: 0.6293, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25050 lambda_1: -0.1230, lambda_2: 209.5367 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.34 0.15 0.06] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100010000010000 10010000010000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.237647, lagrangian_loss: 0.000511, attention_score_distillation_loss: 0.000000 loss: 0.102454, lagrangian_loss: 0.000469, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:32:46 Evaluating: matthews_correlation: 0.5715, eval_loss: 0.638, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25100 lambda_1: -0.4564, lambda_2: 210.2525 lambda_3: 0.0000 train remain: [0.98 0.93 0.87 0.88 0.87 0.82 0.76 0.34 0.14 0.06] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100010000010000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.010125, lagrangian_loss: -0.000235, attention_score_distillation_loss: 0.000000 loss: 0.079447, lagrangian_loss: -0.000088, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:32:58 Evaluating: matthews_correlation: 0.5851, eval_loss: 0.641, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25150 lambda_1: -1.4248, lambda_2: 211.4067 lambda_3: 0.0000 train remain: [0.98 0.93 0.87 0.88 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010001 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.007298, lagrangian_loss: 0.007927, attention_score_distillation_loss: 0.000000 loss: 0.246645, lagrangian_loss: -0.001950, attention_score_distillation_loss: 0.000000 ETA: 0:07:31 | Epoch 93 finished. Took 65.7 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:33:11 Evaluating: matthews_correlation: 0.5914, eval_loss: 0.6323, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25200 lambda_1: -1.4539, lambda_2: 211.9668 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100100000010000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.052082, lagrangian_loss: -0.002111, attention_score_distillation_loss: 0.000000 loss: 0.053456, lagrangian_loss: -0.001442, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:33:23 Evaluating: matthews_correlation: 0.5914, eval_loss: 0.6304, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25250 lambda_1: -0.9848, lambda_2: 212.5184 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100100000010000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.019640, lagrangian_loss: -0.001140, attention_score_distillation_loss: 0.000000 loss: 0.129574, lagrangian_loss: 0.001305, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:33:36 Evaluating: matthews_correlation: 0.5856, eval_loss: 0.6315, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25300 lambda_1: -0.7048, lambda_2: 213.2812 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000100010000 10011000000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.029036, lagrangian_loss: -0.000582, attention_score_distillation_loss: 0.000000 loss: 0.051957, lagrangian_loss: 0.001028, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:33:49 Evaluating: matthews_correlation: 0.5876, eval_loss: 0.6318, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25350 lambda_1: -0.6725, lambda_2: 213.7044 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111101000000010000 10011000000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.064709, lagrangian_loss: 0.005053, attention_score_distillation_loss: 0.000000 loss: 0.452314, lagrangian_loss: 0.010452, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:34:01 Evaluating: matthews_correlation: 0.5853, eval_loss: 0.6362, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25400 lambda_1: -0.8724, lambda_2: 214.1629 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111101000000010000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.156337, lagrangian_loss: 0.000681, attention_score_distillation_loss: 0.000000 loss: 0.008305, lagrangian_loss: 0.010296, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:34:14 Evaluating: matthews_correlation: 0.5831, eval_loss: 0.6295, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25450 lambda_1: -1.1026, lambda_2: 214.5025 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.88 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111101000000010000 10010010000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.013835, lagrangian_loss: 0.001153, attention_score_distillation_loss: 0.000000 ETA: 0:06:15 | Epoch 94 finished. Took 71.27 seconds. loss: 0.063304, lagrangian_loss: -0.001348, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:34:26 Evaluating: matthews_correlation: 0.5811, eval_loss: 0.6323, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25500 lambda_1: -1.1075, lambda_2: 214.9722 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.88 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111101000000010000 10010000000001000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.028111, lagrangian_loss: -0.000372, attention_score_distillation_loss: 0.000000 loss: 0.183757, lagrangian_loss: -0.001322, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:34:39 Evaluating: matthews_correlation: 0.5786, eval_loss: 0.6359, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25550 lambda_1: -1.2260, lambda_2: 215.3933 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.87 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000100010000 10010000000000010000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.141873, lagrangian_loss: 0.001284, attention_score_distillation_loss: 0.000000 loss: 0.046828, lagrangian_loss: -0.001190, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:34:52 Evaluating: matthews_correlation: 0.5956, eval_loss: 0.6239, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25600 lambda_1: -1.0535, lambda_2: 215.8171 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.87 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 11111100000000010000 10010000010000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.006704, lagrangian_loss: 0.002156, attention_score_distillation_loss: 0.000000 loss: 0.050089, lagrangian_loss: -0.000679, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:35:04 Evaluating: matthews_correlation: 0.576, eval_loss: 0.6341, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25650 lambda_1: -0.6736, lambda_2: 216.3524 lambda_3: 0.0000 train remain: [0.98 0.92 0.86 0.87 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010100 10010000010000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.032370, lagrangian_loss: 0.006293, attention_score_distillation_loss: 0.000000 loss: 0.083333, lagrangian_loss: 0.000032, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:35:17 Evaluating: matthews_correlation: 0.5808, eval_loss: 0.6371, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25700 lambda_1: -0.3465, lambda_2: 217.0663 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000100010000 10010000000100000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.373497, lagrangian_loss: 0.002804, attention_score_distillation_loss: 0.000000 loss: 0.107474, lagrangian_loss: 0.003479, attention_score_distillation_loss: 0.000000 ETA: 0:05:00 | Epoch 95 finished. Took 65.86 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:35:30 Evaluating: matthews_correlation: 0.5831, eval_loss: 0.6363, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25750 lambda_1: -0.4295, lambda_2: 217.4894 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000100010000 10010000010000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.182603, lagrangian_loss: 0.002725, attention_score_distillation_loss: 0.000000 loss: 0.628847, lagrangian_loss: -0.000159, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:35:42 Evaluating: matthews_correlation: 0.5782, eval_loss: 0.6358, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25800 lambda_1: -0.8900, lambda_2: 218.0517 lambda_3: 0.0000 train remain: [0.98 0.93 0.87 0.87 0.87 0.82 0.76 0.34 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000110000 10010000010000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.025491, lagrangian_loss: 0.001046, attention_score_distillation_loss: 0.000000 loss: 0.237596, lagrangian_loss: 0.006351, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:35:55 Evaluating: matthews_correlation: 0.5808, eval_loss: 0.6373, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25850 lambda_1: -1.1285, lambda_2: 218.4740 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000100010000 10010000010000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.023916, lagrangian_loss: -0.000196, attention_score_distillation_loss: 0.000000 loss: 0.008388, lagrangian_loss: -0.000733, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:36:08 Evaluating: matthews_correlation: 0.5715, eval_loss: 0.6327, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25900 lambda_1: -0.8742, lambda_2: 219.1641 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100010000010000 10010000010000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.029571, lagrangian_loss: -0.000481, attention_score_distillation_loss: 0.000000 loss: 0.067189, lagrangian_loss: -0.001050, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:36:20 Evaluating: matthews_correlation: 0.5763, eval_loss: 0.6286, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 25950 lambda_1: -0.8978, lambda_2: 219.6616 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010001 10010000010000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.135568, lagrangian_loss: 0.000899, attention_score_distillation_loss: 0.000000 loss: 0.070675, lagrangian_loss: -0.000262, attention_score_distillation_loss: 0.000000 ETA: 0:03:44 | Epoch 96 finished. Took 65.66 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:36:33 Evaluating: matthews_correlation: 0.5763, eval_loss: 0.6325, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26000 lambda_1: -1.2168, lambda_2: 220.3954 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000001010000 10010000010000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.045102, lagrangian_loss: -0.000956, attention_score_distillation_loss: 0.000000 loss: 0.104396, lagrangian_loss: -0.001945, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:36:46 Evaluating: matthews_correlation: 0.5763, eval_loss: 0.6289, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26050 lambda_1: -1.2915, lambda_2: 220.8548 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010001 10010000010000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.006577, lagrangian_loss: 0.011043, attention_score_distillation_loss: 0.000000 loss: 0.151254, lagrangian_loss: 0.007850, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:36:58 Evaluating: matthews_correlation: 0.5808, eval_loss: 0.6331, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26100 lambda_1: -1.2303, lambda_2: 221.3453 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000010000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.050793, lagrangian_loss: -0.001519, attention_score_distillation_loss: 0.000000 loss: 0.044280, lagrangian_loss: -0.000805, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:37:11 Evaluating: matthews_correlation: 0.5853, eval_loss: 0.6346, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26150 lambda_1: -0.7737, lambda_2: 221.8944 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000010000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.145541, lagrangian_loss: 0.005178, attention_score_distillation_loss: 0.000000 loss: 0.103292, lagrangian_loss: 0.002671, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:37:24 Evaluating: matthews_correlation: 0.5834, eval_loss: 0.6292, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26200 lambda_1: -0.6916, lambda_2: 222.4830 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010001 10010000010000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.032278, lagrangian_loss: 0.005178, attention_score_distillation_loss: 0.000000 loss: 0.071999, lagrangian_loss: 0.002291, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:37:36 Evaluating: matthews_correlation: 0.5834, eval_loss: 0.6293, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26250 lambda_1: -0.6305, lambda_2: 222.9000 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.33 0.14 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010001 10010000010000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.335844, lagrangian_loss: 0.000515, attention_score_distillation_loss: 0.000000 ETA: 0:02:29 | Epoch 97 finished. Took 71.52 seconds. loss: 0.043639, lagrangian_loss: 0.002743, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:37:49 Evaluating: matthews_correlation: 0.5808, eval_loss: 0.6322, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26300 lambda_1: -0.6327, lambda_2: 223.5138 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.33 0.13 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100010000010000 10010000010000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.019223, lagrangian_loss: 0.001703, attention_score_distillation_loss: 0.000000 loss: 0.050046, lagrangian_loss: 0.001390, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:38:02 Evaluating: matthews_correlation: 0.5866, eval_loss: 0.6277, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4429, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26350 lambda_1: -0.8121, lambda_2: 224.0178 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.33 0.13 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.15, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.02, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100100000010000 10010000010000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.022526, lagrangian_loss: 0.001381, attention_score_distillation_loss: 0.000000 loss: 0.014921, lagrangian_loss: 0.006243, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:38:15 Evaluating: matthews_correlation: 0.5834, eval_loss: 0.6324, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4433, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26400 lambda_1: -1.1669, lambda_2: 224.5358 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.33 0.13 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.1, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.01, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010001 10010000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.542013, lagrangian_loss: -0.000852, attention_score_distillation_loss: 0.000000 loss: 0.111583, lagrangian_loss: 0.001082, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:38:27 Evaluating: matthews_correlation: 0.5834, eval_loss: 0.6327, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4433, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26450 lambda_1: -1.0906, lambda_2: 225.0586 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.33 0.12 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.1, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.01, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010001 10010000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.111154, lagrangian_loss: -0.000303, attention_score_distillation_loss: 0.000000 loss: 0.144721, lagrangian_loss: 0.001419, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:38:40 Evaluating: matthews_correlation: 0.5834, eval_loss: 0.6329, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4433, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26500 lambda_1: -0.9861, lambda_2: 225.4992 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.33 0.12 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.1, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.01, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010001 10010000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.028278, lagrangian_loss: -0.000979, attention_score_distillation_loss: 0.000000 loss: 0.612769, lagrangian_loss: 0.002134, attention_score_distillation_loss: 0.000000 ETA: 0:01:14 | Epoch 98 finished. Took 66.21 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:38:53 Evaluating: matthews_correlation: 0.5882, eval_loss: 0.6329, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4433, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26550 lambda_1: -1.0010, lambda_2: 225.9695 lambda_3: 0.0000 train remain: [0.98 0.92 0.87 0.87 0.87 0.82 0.76 0.33 0.12 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.1, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.01, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010100 10010000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.048688, lagrangian_loss: -0.001015, attention_score_distillation_loss: 0.000000 loss: 0.009625, lagrangian_loss: 0.000727, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:39:05 Evaluating: matthews_correlation: 0.5859, eval_loss: 0.6354, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4433, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26600 lambda_1: -0.9018, lambda_2: 226.5951 lambda_3: 0.0000 train remain: [0.98 0.93 0.87 0.87 0.87 0.82 0.76 0.33 0.12 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.1, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.01, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111110000000010000 10010000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.089381, lagrangian_loss: 0.003946, attention_score_distillation_loss: 0.000000 loss: 0.017401, lagrangian_loss: 0.002636, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:39:18 Evaluating: matthews_correlation: 0.5834, eval_loss: 0.6327, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4433, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26650 lambda_1: -1.0658, lambda_2: 227.2134 lambda_3: 0.0000 train remain: [0.98 0.93 0.87 0.87 0.86 0.82 0.76 0.33 0.11 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.1, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.01, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010001 10010000000000000000 10000000000000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.039286, lagrangian_loss: -0.000238, attention_score_distillation_loss: 0.000000 loss: 0.404379, lagrangian_loss: 0.003157, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:39:31 Evaluating: matthews_correlation: 0.5853, eval_loss: 0.6366, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4433, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26700 lambda_1: -1.0616, lambda_2: 227.8146 lambda_3: 0.0000 train remain: [0.98 0.93 0.87 0.87 0.86 0.82 0.76 0.33 0.11 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.1, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.01, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000000010001 10010000000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.012405, lagrangian_loss: -0.001030, attention_score_distillation_loss: 0.000000 loss: 0.088163, lagrangian_loss: 0.005277, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:39:43 Evaluating: matthews_correlation: 0.5914, eval_loss: 0.6328, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4433, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26750 lambda_1: -1.0038, lambda_2: 228.4114 lambda_3: 0.0000 train remain: [0.98 0.93 0.87 0.87 0.86 0.82 0.76 0.33 0.11 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.1, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.01, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100100000010000 10010000000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 loss: 0.039096, lagrangian_loss: 0.003347, attention_score_distillation_loss: 0.000000 loss: 0.063798, lagrangian_loss: -0.000523, attention_score_distillation_loss: 0.000000 ---------------------------------------------------------------------- time: 2023-07-19 16:39:56 Evaluating: matthews_correlation: 0.5831, eval_loss: 0.6388, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4674, expected_sparsity: 0.4433, expected_sequence_sparsity: 0.9008, target_sparsity: 0.43, step: 26800 lambda_1: -0.5746, lambda_2: 228.9871 lambda_3: 0.0000 train remain: [0.98 0.93 0.87 0.87 0.86 0.82 0.76 0.34 0.11 0.05] infer remain: [1.0, 0.9, 0.85, 0.85, 0.85, 0.8, 0.75, 0.35, 0.1, 0.05] layerwise remain: [1.0, 1.0, 1.0, 0.9, 0.76, 0.65, 0.55, 0.44, 0.33, 0.12, 0.01, 0.0] 11111111111111111111 11111111110111111110 11111111110111111100 11111111110111111100 11111111111111110100 11111111110111110100 11111111110101110100 01111100000100010000 10010000000000000000 00000000010000000000 Best eval score so far: 0.6029 @ step 18750 epoch 69.96 ETA: 0:00:00 | Epoch 99 finished. Took 71.14 seconds. 07/19/2023 16:42:01 - WARNING - urllib3.connectionpool - Retrying (Retry(total=4, connect=5, read=4, redirect=5, status=5)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='southcentralus.api.azureml.ms', port=443): Read timed out. (read timeout=120)")': /mlflow/v2.0/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourceGroups/gcr-singularity-octo/providers/Microsoft.MachineLearningServices/workspaces/msroctows/api/2.0/mlflow/runs/get?run_uuid=1c3668fe-9b9b-4fee-8aa9-d0e29ec9c11e&run_id=1c3668fe-9b9b-4fee-8aa9-d0e29ec9c11e