/home/aiscuser/.local/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" 2023/07/19 14:23:13 WARNING mlflow.utils.autologging_utils: You are using an unsupported version of transformers. If you encounter errors during autologging, try upgrading / downgrading transformers to a supported version, or try upgrading MLflow. 2023/07/19 14:23:13 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn. 2023/07/19 14:23:13 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers. Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Downloading and preparing dataset glue/mnli to /home/aiscuser/.cache/huggingface/datasets/glue/mnli/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353... Downloading data: 0%| | 0.00/313M [00:00 Training Arguments TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, bf16=False, bf16_full_eval=False, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_steps=2000, evaluation_strategy=IntervalStrategy.STEPS, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, label_names=None, label_smoothing_factor=0.0, learning_rate=3e-05, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=40, log_level_replica=-1, log_on_each_node=True, logging_dir=/mnt/data/device-aware-bert/token_pruning/experiments/MNLI/reproduce1/s0.5_lr3e-05_reglr0.01_alpha0.002_warmup10_bin50/runs/Jul19_14-23-14_node-0, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=100, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=40.0, optim=OptimizerNames.ADAMW_HF, output_dir=/mnt/data/device-aware-bert/token_pruning/experiments/MNLI/reproduce1/s0.5_lr3e-05_reglr0.01_alpha0.002_warmup10_bin50, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=32, per_device_train_batch_size=32, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, remove_unused_columns=True, report_to=['mlflow'], resume_from_checkpoint=None, run_name=/mnt/data/device-aware-bert/token_pruning/experiments/MNLI/reproduce1/s0.5_lr3e-05_reglr0.01_alpha0.002_warmup10_bin50, save_on_each_node=False, save_steps=0, save_strategy=IntervalStrategy.STEPS, save_total_limit=None, seed=57, sharded_ddp=[], skip_memory_metrics=True, tf32=None, tpu_metrics_debug=False, tpu_num_cores=None, use_legacy_prediction_loop=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) Additional Arguments AdditionalArguments(test=False, ex_name='s0.5_lr3e-05_reglr0.01_alpha0.002_warmup10_bin50', pruning_type='token+pruner', reg_learning_rate=0.01, scheduler_type='linear', freeze_embeddings=True, pretrained_pruned_model=None, droprate_init=0.01, temperature=0.6666666666666666, prepruning_finetune_epochs=1, lagrangian_warmup_epochs=10, target_sparsity=0.5, sparsity_epsilon=0, distillation_path='/mnt/data/device-aware-bert/token_pruning/teachers/MNLI', do_distill=True, do_layer_distill=False, layer_distill_version=4, distill_loss_alpha=0.9, distill_ce_loss_alpha=0.002, distill_temp=2.0, use_mac_l0=True, prune_location=[2, 3, 4, 5, 6, 7, 8, 9, 10, 11], bin_num=50, topk=20) ---------------------------------------------------------------------- time: 2023-07-19 14:26:47 Evaluating: accuracy: 0.8498, eval_loss: 0.4721, step: 0 lambda_1: 0.0000, lambda_2: 0.0000 lambda_3: 0.0000 Starting l0 regularization! using , temperature: 0.67, init drop rate: 0.01 token_loga shape: [10, 50] prune location: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] NDCG TOPK= 20 loss: 0.155196, lagrangian_loss: -0.003476, attention_score_distillation_loss: 0.018882 ---------------------------------------------------------------------- time: 2023-07-19 14:36:56 Evaluating: accuracy: 0.8391, eval_loss: 0.5229, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0178, expected_sparsity: 0.0162, expected_sequence_sparsity: 0.704, target_sparsity: 0.0081, step: 2000 lambda_1: 0.0091, lambda_2: 26.4064 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.99 1. 1. 0.99 0.88] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111011111111101111111110100110 loss: 0.176654, lagrangian_loss: 0.011700, attention_score_distillation_loss: 0.018975 loss: 0.149665, lagrangian_loss: 0.005952, attention_score_distillation_loss: 0.018948 ---------------------------------------------------------------------- time: 2023-07-19 14:47:02 Evaluating: accuracy: 0.8352, eval_loss: 0.5298, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0107, expected_sparsity: 0.0081, expected_sequence_sparsity: 0.7015, target_sparsity: 0.0163, step: 4000 lambda_1: 1.9224, lambda_2: 36.0220 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.99 1. 1. 1. 0.95] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111110101110 loss: 0.155903, lagrangian_loss: -0.000476, attention_score_distillation_loss: 0.018805 loss: 0.220502, lagrangian_loss: -0.007133, attention_score_distillation_loss: 0.018668 ---------------------------------------------------------------------- time: 2023-07-19 14:57:11 Evaluating: accuracy: 0.8324, eval_loss: 0.5728, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0178, expected_sparsity: 0.0162, expected_sequence_sparsity: 0.704, target_sparsity: 0.0244, step: 6000 lambda_1: -0.8247, lambda_2: 40.7975 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.98 1. 1. 1. 0.89] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111100100110 loss: 0.177988, lagrangian_loss: -0.002550, attention_score_distillation_loss: 0.018595 loss: 0.226307, lagrangian_loss: 0.000066, attention_score_distillation_loss: 0.018657 ---------------------------------------------------------------------- time: 2023-07-19 15:07:19 Evaluating: accuracy: 0.8349, eval_loss: 0.5403, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0249, expected_sparsity: 0.0215, expected_sequence_sparsity: 0.7056, target_sparsity: 0.0326, step: 8000 lambda_1: -0.0580, lambda_2: 42.1910 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.98 1. 1. 1. 0.84] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111101111101100100110 loss: 0.763291, lagrangian_loss: -0.000020, attention_score_distillation_loss: 0.018453 loss: 0.128044, lagrangian_loss: 0.000392, attention_score_distillation_loss: 0.017690 ---------------------------------------------------------------------- time: 2023-07-19 15:17:28 Evaluating: accuracy: 0.8289, eval_loss: 0.5618, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0285, expected_sparsity: 0.0269, expected_sequence_sparsity: 0.7072, target_sparsity: 0.0407, step: 10000 lambda_1: -0.4853, lambda_2: 46.2208 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.98 1. 1. 1. 0.8 ] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111101101101100100100 loss: 0.168685, lagrangian_loss: 0.000665, attention_score_distillation_loss: 0.017665 loss: 0.108938, lagrangian_loss: -0.000439, attention_score_distillation_loss: 0.017357 ---------------------------------------------------------------------- time: 2023-07-19 15:27:34 Evaluating: accuracy: 0.8315, eval_loss: 0.5308, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0356, expected_sparsity: 0.035, expected_sequence_sparsity: 0.7097, target_sparsity: 0.0489, step: 12000 lambda_1: 0.4534, lambda_2: 50.5742 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.98 1. 1. 1. 0.74] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.74] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111100100100100100100 loss: 0.297896, lagrangian_loss: 0.000279, attention_score_distillation_loss: 0.017658 ETA: 1 day, 16:19:30 | Epoch 0 finished. Took 3722.31 seconds. loss: 0.067071, lagrangian_loss: -0.000199, attention_score_distillation_loss: 0.017195 ---------------------------------------------------------------------- time: 2023-07-19 15:37:40 Evaluating: accuracy: 0.8379, eval_loss: 0.5044, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0392, expected_sparsity: 0.0377, expected_sequence_sparsity: 0.7105, target_sparsity: 0.057, step: 14000 lambda_1: -1.3611, lambda_2: 65.0686 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.97 1. 1. 0.99 0.71] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.72] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111100100100100100000 loss: 0.183393, lagrangian_loss: -0.001362, attention_score_distillation_loss: 0.016639 loss: 0.174662, lagrangian_loss: -0.000096, attention_score_distillation_loss: 0.017335 ---------------------------------------------------------------------- time: 2023-07-19 15:47:50 Evaluating: accuracy: 0.8314, eval_loss: 0.5561, token_prune_loc: [False, False, False, False, False, True, False, False, False, True], macs_sparsity: 0.0757, expected_sparsity: 0.068, expected_sequence_sparsity: 0.7197, target_sparsity: 0.0652, step: 16000 lambda_1: -1.0132, lambda_2: 80.4361 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.97 1. 0.99 0.98 0.68] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 1.0, 1.0, 1.0, 0.68] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.94, 0.94, 0.94, 0.64] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111101010 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111100000100000100000 loss: 0.314749, lagrangian_loss: 0.000974, attention_score_distillation_loss: 0.016515 loss: 0.214502, lagrangian_loss: 0.001886, attention_score_distillation_loss: 0.016925 ---------------------------------------------------------------------- time: 2023-07-19 15:58:01 Evaluating: accuracy: 0.8352, eval_loss: 0.5323, token_prune_loc: [False, False, False, False, False, True, False, True, True, True], macs_sparsity: 0.0893, expected_sparsity: 0.0815, expected_sequence_sparsity: 0.7238, target_sparsity: 0.0733, step: 18000 lambda_1: -0.5515, lambda_2: 91.8561 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.96 1. 0.99 0.97 0.66] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 1.0, 0.98, 0.96, 0.66] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.94, 0.92, 0.88, 0.58] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111101010 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111100 00111111111111111111111111111111100000100000100000 loss: 0.332707, lagrangian_loss: -0.000568, attention_score_distillation_loss: 0.016829 loss: 0.388896, lagrangian_loss: 0.000044, attention_score_distillation_loss: 0.016486 ---------------------------------------------------------------------- time: 2023-07-19 16:08:06 Evaluating: accuracy: 0.8305, eval_loss: 0.5626, token_prune_loc: [False, False, False, False, False, True, False, True, True, True], macs_sparsity: 0.0928, expected_sparsity: 0.0839, expected_sequence_sparsity: 0.7245, target_sparsity: 0.0815, step: 20000 lambda_1: -0.4748, lambda_2: 96.0222 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.95 1. 0.99 0.97 0.64] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 1.0, 0.98, 0.96, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.94, 0.92, 0.88, 0.57] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111101010 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111100 00111111111111111111111111111111100000000000100000 loss: 0.328433, lagrangian_loss: 0.000027, attention_score_distillation_loss: 0.016496 loss: 0.073589, lagrangian_loss: -0.000078, attention_score_distillation_loss: 0.016037 ---------------------------------------------------------------------- time: 2023-07-19 16:18:17 Evaluating: accuracy: 0.8341, eval_loss: 0.5229, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.1028, expected_sparsity: 0.0958, expected_sequence_sparsity: 0.7281, target_sparsity: 0.0896, step: 22000 lambda_1: -0.8310, lambda_2: 112.1983 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 0.95 0.97 0.99 0.96 0.63] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.96, 0.98, 0.96, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.9, 0.88, 0.85, 0.54] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111101010 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111100 00111111111111111111111111111111101000000000000000 loss: 0.418267, lagrangian_loss: 0.000028, attention_score_distillation_loss: 0.015915 loss: 0.281882, lagrangian_loss: 0.001161, attention_score_distillation_loss: 0.015568 ---------------------------------------------------------------------- time: 2023-07-19 16:28:29 Evaluating: accuracy: 0.8393, eval_loss: 0.5295, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.1128, expected_sparsity: 0.1069, expected_sequence_sparsity: 0.7315, target_sparsity: 0.0978, step: 24000 lambda_1: 0.0601, lambda_2: 122.4728 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.99 0.95 0.97 0.98 0.95 0.62] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.94, 0.98, 0.94, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.88, 0.87, 0.81, 0.5] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111101010 11111111111111111111111111111111111111111111101100 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111110111100 00111111111111111111111111111111100000000000000000 loss: 0.113096, lagrangian_loss: -0.000002, attention_score_distillation_loss: 0.015794 ETA: 1 day, 15:21:11 | Epoch 1 finished. Took 3734.06 seconds. loss: 0.126740, lagrangian_loss: 0.000766, attention_score_distillation_loss: 0.015541 ---------------------------------------------------------------------- time: 2023-07-19 16:38:43 Evaluating: accuracy: 0.8408, eval_loss: 0.5302, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.1128, expected_sparsity: 0.1069, expected_sequence_sparsity: 0.7315, target_sparsity: 0.1059, step: 26000 lambda_1: -0.5634, lambda_2: 139.0166 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.98 0.95 0.96 0.98 0.94 0.62] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.94, 0.98, 0.94, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.88, 0.87, 0.81, 0.5] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111101010 11111111111111111111111111111111111111111111101100 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111110111100 00111111111111111111111110111111100000100000000000 loss: 0.306678, lagrangian_loss: -0.000308, attention_score_distillation_loss: 0.015029 loss: 0.255736, lagrangian_loss: -0.000572, attention_score_distillation_loss: 0.015208 ---------------------------------------------------------------------- time: 2023-07-19 16:48:46 Evaluating: accuracy: 0.8376, eval_loss: 0.5366, token_prune_loc: [False, False, False, False, False, True, True, False, True, True], macs_sparsity: 0.1207, expected_sparsity: 0.1115, expected_sequence_sparsity: 0.7329, target_sparsity: 0.1141, step: 28000 lambda_1: -0.2929, lambda_2: 144.7961 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.97 0.95 0.95 0.99 0.94 0.62] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.92, 1.0, 0.92, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.86, 0.86, 0.8, 0.49] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111101010 11111111111111111111111111111111111111111111101000 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111110101100 00111111111111111111111110111111110000000000000000 loss: 0.181628, lagrangian_loss: 0.000002, attention_score_distillation_loss: 0.015025 loss: 0.156593, lagrangian_loss: 0.000935, attention_score_distillation_loss: 0.014939 ---------------------------------------------------------------------- time: 2023-07-19 16:58:58 Evaluating: accuracy: 0.8407, eval_loss: 0.523, token_prune_loc: [False, False, False, False, False, True, True, True, True, True], macs_sparsity: 0.1328, expected_sparsity: 0.1228, expected_sequence_sparsity: 0.7363, target_sparsity: 0.1222, step: 30000 lambda_1: 0.2479, lambda_2: 157.3795 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.97 0.94 0.95 0.99 0.93 0.62] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.98, 0.92, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.85, 0.83, 0.76, 0.47] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111100010 11111111111111111111111111111111111111111111101000 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111110101100 01111111111111111111111110111111100000000000000000 loss: 0.205032, lagrangian_loss: -0.000065, attention_score_distillation_loss: 0.014612 loss: 0.167308, lagrangian_loss: 0.000275, attention_score_distillation_loss: 0.014546 ---------------------------------------------------------------------- time: 2023-07-19 17:09:07 Evaluating: accuracy: 0.8374, eval_loss: 0.5299, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.1578, expected_sparsity: 0.1479, expected_sequence_sparsity: 0.7439, target_sparsity: 0.1304, step: 32000 lambda_1: 0.1703, lambda_2: 169.2462 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.96 0.93 0.94 0.99 0.92 0.62] infer remain: [1.0, 1.0, 1.0, 1.0, 0.94, 0.92, 0.92, 0.98, 0.92, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.86, 0.8, 0.78, 0.72, 0.44] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111101100 11111111111111111111111111111111111111111111100010 11111111111111111111111111111111111111111111101000 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111110101100 00111111111111111111111110111111100000100000000000 loss: 0.067647, lagrangian_loss: -0.000031, attention_score_distillation_loss: 0.014025 loss: 0.377549, lagrangian_loss: -0.000638, attention_score_distillation_loss: 0.014421 ---------------------------------------------------------------------- time: 2023-07-19 17:19:15 Evaluating: accuracy: 0.8371, eval_loss: 0.5323, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.16, expected_sparsity: 0.1504, expected_sequence_sparsity: 0.7447, target_sparsity: 0.1385, step: 34000 lambda_1: -0.1908, lambda_2: 180.0017 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.95 0.94 0.93 0.98 0.91 0.62] infer remain: [1.0, 1.0, 1.0, 1.0, 0.94, 0.92, 0.92, 0.98, 0.9, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.86, 0.8, 0.78, 0.7, 0.44] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111101100 11111111111111111111111111111111111111111111100010 11111111111111111111111111111111111111111111101000 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111100101100 00111111111111111111111110111111100000000000100000 loss: 0.120732, lagrangian_loss: 0.000290, attention_score_distillation_loss: 0.013715 loss: 0.210080, lagrangian_loss: -0.000074, attention_score_distillation_loss: 0.013679 ---------------------------------------------------------------------- time: 2023-07-19 17:29:27 Evaluating: accuracy: 0.8335, eval_loss: 0.5465, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.1721, expected_sparsity: 0.1623, expected_sequence_sparsity: 0.7483, target_sparsity: 0.1467, step: 36000 lambda_1: -0.2603, lambda_2: 192.8188 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.94 0.93 0.93 0.97 0.9 0.62] infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.92, 0.96, 0.9, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.85, 0.78, 0.75, 0.67, 0.42] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111100100 11111111111111111111111111111111111111111111100010 11111111111111111111111111111111111111111111101000 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111100101100 00111111111111111111111110111111100000000001000000 loss: 0.103417, lagrangian_loss: 0.000188, attention_score_distillation_loss: 0.013585 ETA: 1 day, 14:20:16 | Epoch 2 finished. Took 3734.13 seconds. loss: 0.062568, lagrangian_loss: -0.000207, attention_score_distillation_loss: 0.013716 ---------------------------------------------------------------------- time: 2023-07-19 17:39:39 Evaluating: accuracy: 0.84, eval_loss: 0.5432, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.1721, expected_sparsity: 0.1647, expected_sequence_sparsity: 0.749, target_sparsity: 0.1548, step: 38000 lambda_1: -1.0436, lambda_2: 202.2836 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.94 0.93 0.93 0.97 0.88 0.62] infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.92, 0.96, 0.88, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.85, 0.78, 0.75, 0.66, 0.41] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111100100 11111111111111111111111111111111111111111111100010 11111111111111111111111111111111111111111111101000 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111100100100 00111111111111111111111110111111100000000000000001 loss: 0.263070, lagrangian_loss: 0.001045, attention_score_distillation_loss: 0.013073 loss: 0.089275, lagrangian_loss: -0.000135, attention_score_distillation_loss: 0.013400 ---------------------------------------------------------------------- time: 2023-07-19 17:49:53 Evaluating: accuracy: 0.8323, eval_loss: 0.5209, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.1743, expected_sparsity: 0.1671, expected_sequence_sparsity: 0.7498, target_sparsity: 0.163, step: 40000 lambda_1: -0.8880, lambda_2: 211.6548 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.93 0.92 0.93 0.97 0.87 0.62] infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.92, 0.96, 0.86, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.85, 0.78, 0.75, 0.64, 0.4] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111100100 11111111111111111111111111111111111111111111100010 11111111111111111111111111111111111111111111101000 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111100000100 00111111111111111111111110111111110000000000000000 loss: 0.132181, lagrangian_loss: 0.001441, attention_score_distillation_loss: 0.012878 loss: 0.118463, lagrangian_loss: 0.000163, attention_score_distillation_loss: 0.012812 ---------------------------------------------------------------------- time: 2023-07-19 18:00:07 Evaluating: accuracy: 0.836, eval_loss: 0.5474, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.1864, expected_sparsity: 0.1736, expected_sequence_sparsity: 0.7517, target_sparsity: 0.1711, step: 42000 lambda_1: -0.4460, lambda_2: 222.7386 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.92 0.91 0.92 0.97 0.87 0.62] infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.9, 0.92, 0.96, 0.86, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.83, 0.76, 0.73, 0.63, 0.39] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111100100 11111111111111111111111111111111111111111111100000 11111111111111111111111111111111111111111111101000 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111100000100 00111111111111111111111110111111100000000000000100 loss: 0.134157, lagrangian_loss: 0.000129, attention_score_distillation_loss: 0.012897 loss: 0.146968, lagrangian_loss: -0.000031, attention_score_distillation_loss: 0.012697 ---------------------------------------------------------------------- time: 2023-07-19 18:10:24 Evaluating: accuracy: 0.841, eval_loss: 0.5293, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.195, expected_sparsity: 0.1879, expected_sequence_sparsity: 0.756, target_sparsity: 0.1793, step: 44000 lambda_1: -0.9139, lambda_2: 235.7542 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.91 0.89 0.93 0.97 0.87 0.62] infer remain: [1.0, 1.0, 1.0, 1.0, 0.9, 0.88, 0.92, 0.96, 0.86, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.79, 0.73, 0.7, 0.6, 0.37] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111100000 11111111111111111111111111111111111111111111000000 11111111111111111111111111111111111111111111101000 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111100000100 00111111111111111111111110111111100000000000000001 loss: 0.185002, lagrangian_loss: -0.000637, attention_score_distillation_loss: 0.012658 loss: 0.083014, lagrangian_loss: 0.000063, attention_score_distillation_loss: 0.012448 ---------------------------------------------------------------------- time: 2023-07-19 18:20:37 Evaluating: accuracy: 0.8322, eval_loss: 0.5557, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2049, expected_sparsity: 0.1958, expected_sequence_sparsity: 0.7584, target_sparsity: 0.1874, step: 46000 lambda_1: -0.7079, lambda_2: 246.9659 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.91 0.88 0.92 0.96 0.87 0.61] infer remain: [1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.92, 0.96, 0.86, 0.6] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.77, 0.71, 0.68, 0.59, 0.35] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111100000 11111111111111111111111111111111111111111110000000 11111111111111111111111111111111111111111111101000 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111100000100 00111111111111111111111110111111100000000000000000 loss: 0.123164, lagrangian_loss: -0.000046, attention_score_distillation_loss: 0.012146 loss: 0.262360, lagrangian_loss: -0.000987, attention_score_distillation_loss: 0.011925 ---------------------------------------------------------------------- time: 2023-07-19 18:30:51 Evaluating: accuracy: 0.8408, eval_loss: 0.5098, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2049, expected_sparsity: 0.1958, expected_sequence_sparsity: 0.7584, target_sparsity: 0.1956, step: 48000 lambda_1: -1.2826, lambda_2: 257.8509 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.91 0.87 0.92 0.96 0.86 0.6 ] infer remain: [1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.92, 0.96, 0.86, 0.6] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.77, 0.71, 0.68, 0.59, 0.35] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111100000 11111111111111111111111111111111111111111110000000 11111111111111111111111111111111111111111111101000 11111111111111111111111111111111111111111111111100 11111111111111111111111111111111111111111100000100 00111111111111111111111110111101100000000010000000 loss: 0.100423, lagrangian_loss: -0.000768, attention_score_distillation_loss: 0.011991 loss: 0.075112, lagrangian_loss: -0.000328, attention_score_distillation_loss: 0.011465 ETA: 1 day, 13:22:07 | Epoch 3 finished. Took 3756.94 seconds. ---------------------------------------------------------------------- time: 2023-07-19 18:41:03 Evaluating: accuracy: 0.8368, eval_loss: 0.5403, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2085, expected_sparsity: 0.2005, expected_sequence_sparsity: 0.7599, target_sparsity: 0.2037, step: 50000 lambda_1: -0.6683, lambda_2: 268.4996 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.9 0.87 0.92 0.94 0.86 0.59] infer remain: [1.0, 1.0, 1.0, 1.0, 0.9, 0.86, 0.92, 0.94, 0.86, 0.58] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.77, 0.71, 0.67, 0.58, 0.33] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111100000 11111111111111111111111111111111111111111110000000 11111111111111111111111111111111111111111111101000 11111111111111111111111111111111111111111111101100 11111111111111111111111111111111111111111100000100 00111111111111111111111110111101100000000000000000 loss: 0.051785, lagrangian_loss: -0.000389, attention_score_distillation_loss: 0.011705 loss: 0.045426, lagrangian_loss: 0.000273, attention_score_distillation_loss: 0.011251 ---------------------------------------------------------------------- time: 2023-07-19 18:51:13 Evaluating: accuracy: 0.8372, eval_loss: 0.5405, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.227, expected_sparsity: 0.2153, expected_sequence_sparsity: 0.7643, target_sparsity: 0.2119, step: 52000 lambda_1: -1.6266, lambda_2: 279.9109 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.89 0.87 0.91 0.93 0.86 0.58] infer remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.86, 0.9, 0.92, 0.86, 0.58] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.76, 0.68, 0.63, 0.54, 0.31] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111000000 11111111111111111111111111111111111111111110000000 11111111111111111111111111111111111111111111100000 11111111111111111111111111111111111111111111101000 11111111111111111111111111111111111111111100000100 00111111111111111111111110111101000100000000000000 loss: 0.218590, lagrangian_loss: 0.000950, attention_score_distillation_loss: 0.010928 loss: 0.114248, lagrangian_loss: -0.000022, attention_score_distillation_loss: 0.011049 ---------------------------------------------------------------------- time: 2023-07-19 19:01:26 Evaluating: accuracy: 0.8335, eval_loss: 0.5343, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.227, expected_sparsity: 0.2167, expected_sequence_sparsity: 0.7648, target_sparsity: 0.22, step: 54000 lambda_1: -0.5925, lambda_2: 291.9427 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.89 0.86 0.91 0.92 0.86 0.56] infer remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.86, 0.9, 0.92, 0.86, 0.56] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.76, 0.68, 0.63, 0.54, 0.3] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111000000 11111111111111111111111111111111111111111110000000 11111111111111111111111111111111111111111111100000 11111111111111111111111111111111111111111111101000 11111111111111111111111111111111111111111100000100 00111111111111111111111110111101000000000000000000 loss: 0.154739, lagrangian_loss: -0.000030, attention_score_distillation_loss: 0.010929 loss: 0.287467, lagrangian_loss: 0.000710, attention_score_distillation_loss: 0.010370 ---------------------------------------------------------------------- time: 2023-07-19 19:11:44 Evaluating: accuracy: 0.8354, eval_loss: 0.5414, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2349, expected_sparsity: 0.2268, expected_sequence_sparsity: 0.7678, target_sparsity: 0.2282, step: 56000 lambda_1: -1.1425, lambda_2: 303.2679 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.99 0.89 0.85 0.9 0.91 0.86 0.54] infer remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.84, 0.9, 0.9, 0.86, 0.54] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.74, 0.67, 0.6, 0.51, 0.28] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111000000 11111111111111111111111111111111111111110110000000 11111111111111111111111111111111111111111111100000 11111111111111111111111111111111111111111111100000 11111111111111111111111111111111111111111100000100 00111111111111111111111110110101000000000000000000 loss: 0.304629, lagrangian_loss: -0.000063, attention_score_distillation_loss: 0.010572 loss: 0.142589, lagrangian_loss: -0.000480, attention_score_distillation_loss: 0.010392 ---------------------------------------------------------------------- time: 2023-07-19 19:22:02 Evaluating: accuracy: 0.8352, eval_loss: 0.5502, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.2434, expected_sparsity: 0.2346, expected_sequence_sparsity: 0.7702, target_sparsity: 0.2363, step: 58000 lambda_1: -2.5594, lambda_2: 315.5283 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.98 0.88 0.85 0.9 0.9 0.86 0.54] infer remain: [1.0, 1.0, 1.0, 0.98, 0.88, 0.84, 0.9, 0.9, 0.86, 0.54] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.98, 0.86, 0.72, 0.65, 0.59, 0.5, 0.27] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111000000 11111111111111111111111111111111111111110110000000 11111111111111111111111111111111111111111111100000 11111111111111111111111111111111111111111111100000 11111111111111111111111111111111111111111100000100 00111111111111111111111110100101000000000000000001 loss: 0.075688, lagrangian_loss: 0.000676, attention_score_distillation_loss: 0.010002 loss: 0.086275, lagrangian_loss: 0.001107, attention_score_distillation_loss: 0.009592 ---------------------------------------------------------------------- time: 2023-07-19 19:32:22 Evaluating: accuracy: 0.8354, eval_loss: 0.5425, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.2542, expected_sparsity: 0.2425, expected_sequence_sparsity: 0.7726, target_sparsity: 0.2445, step: 60000 lambda_1: -1.6286, lambda_2: 327.6156 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.97 0.88 0.84 0.89 0.9 0.86 0.54] infer remain: [1.0, 1.0, 1.0, 0.96, 0.88, 0.84, 0.9, 0.9, 0.86, 0.54] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.96, 0.84, 0.71, 0.64, 0.57, 0.49, 0.27] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111011110 11111111111111111111111111111111111111111111000000 11111111111111111111111111111111111111110110000000 11111111111111111111111111111111111111111111100000 11111111111111111111111111111111111111111111100000 11111111111111111111111111111111111111111100000100 00111111111111111111111110100101000000100000000000 loss: 0.096564, lagrangian_loss: -0.001080, attention_score_distillation_loss: 0.009995 loss: 0.090069, lagrangian_loss: -0.000528, attention_score_distillation_loss: 0.009846 ETA: 1 day, 12:24:55 | Epoch 4 finished. Took 3780.52 seconds. ---------------------------------------------------------------------- time: 2023-07-19 19:42:50 Evaluating: accuracy: 0.8365, eval_loss: 0.5358, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.2663, expected_sparsity: 0.2584, expected_sequence_sparsity: 0.7774, target_sparsity: 0.2526, step: 62000 lambda_1: -0.4206, lambda_2: 339.1541 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.95 0.88 0.84 0.89 0.89 0.84 0.54] infer remain: [1.0, 1.0, 1.0, 0.94, 0.88, 0.84, 0.88, 0.88, 0.84, 0.54] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.83, 0.69, 0.61, 0.54, 0.45, 0.24] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111011010 11111111111111111111111111111111111111111111000000 11111111111111111111111111111111111111110110000000 11111111111111111111111111111111111111111101100000 11111111111111111111111111111111111111111110100000 11111111111111111111111111111111111111111100000000 00111111111111111111111110100101000100000000000000 loss: 0.090838, lagrangian_loss: 0.000248, attention_score_distillation_loss: 0.009427 loss: 0.295040, lagrangian_loss: 0.000173, attention_score_distillation_loss: 0.009286 ---------------------------------------------------------------------- time: 2023-07-19 19:53:06 Evaluating: accuracy: 0.8293, eval_loss: 0.5511, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.2706, expected_sparsity: 0.2621, expected_sequence_sparsity: 0.7785, target_sparsity: 0.2607, step: 64000 lambda_1: -0.5806, lambda_2: 351.3143 lambda_3: 0.0000 train remain: [1. 1. 1. 0.95 0.88 0.84 0.87 0.88 0.84 0.54] infer remain: [1.0, 1.0, 1.0, 0.94, 0.88, 0.84, 0.86, 0.88, 0.84, 0.54] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.94, 0.83, 0.69, 0.6, 0.53, 0.44, 0.24] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111011010 11111111111111111111111111111111111111111111000000 11111111111111111111111111111111111111110110000000 11111111111111111111111111111111111111111100100000 11111111111111111111111111111111111111111110100000 11111111111111111111111111111111111111111100000000 00111111111111111111111110100101010000000000000000 loss: 0.115170, lagrangian_loss: 0.000359, attention_score_distillation_loss: 0.009317 loss: 0.069134, lagrangian_loss: -0.000489, attention_score_distillation_loss: 0.009274 ---------------------------------------------------------------------- time: 2023-07-19 20:03:28 Evaluating: accuracy: 0.8296, eval_loss: 0.5635, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.2869, expected_sparsity: 0.2773, expected_sequence_sparsity: 0.7831, target_sparsity: 0.2689, step: 66000 lambda_1: -0.9358, lambda_2: 363.3222 lambda_3: 0.0000 train remain: [1. 1. 1. 0.92 0.88 0.84 0.87 0.88 0.84 0.54] infer remain: [1.0, 1.0, 1.0, 0.9, 0.88, 0.84, 0.86, 0.88, 0.84, 0.54] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.79, 0.67, 0.57, 0.5, 0.42, 0.23] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111001000 11111111111111111111111111111111111111111111000000 11111111111111111111111111111111111111110110000000 11111111111111111111111111111111111111111100100000 11111111111111111111111111111111111111111110100000 11111111111111111111111111111111111111111100000000 00111111111111111111111110100101000000000001000000 loss: 0.165418, lagrangian_loss: 0.000515, attention_score_distillation_loss: 0.008697 loss: 0.210169, lagrangian_loss: -0.000082, attention_score_distillation_loss: 0.008782 ---------------------------------------------------------------------- time: 2023-07-19 20:13:46 Evaluating: accuracy: 0.8307, eval_loss: 0.5531, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.2955, expected_sparsity: 0.2849, expected_sequence_sparsity: 0.7854, target_sparsity: 0.277, step: 68000 lambda_1: -0.7409, lambda_2: 374.3768 lambda_3: 0.0000 train remain: [1. 1. 1. 0.91 0.88 0.84 0.87 0.88 0.84 0.54] infer remain: [1.0, 1.0, 1.0, 0.88, 0.88, 0.84, 0.86, 0.88, 0.84, 0.54] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.65, 0.56, 0.49, 0.41, 0.22] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111101001000 11111111111111111111111111111111111111111111000000 11111111111111111111111111111111111111110110000000 11111111111111111111111111111111111111111100100000 11111111111111111111111111111111111111111110100000 11111111111111111111111111111111111111111100000000 00111111111111111111111110100101000000000000100000 loss: 0.126009, lagrangian_loss: -0.000366, attention_score_distillation_loss: 0.008688 loss: 0.138773, lagrangian_loss: 0.001237, attention_score_distillation_loss: 0.008460 ---------------------------------------------------------------------- time: 2023-07-19 20:24:09 Evaluating: accuracy: 0.8264, eval_loss: 0.5675, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.2955, expected_sparsity: 0.286, expected_sequence_sparsity: 0.7858, target_sparsity: 0.2852, step: 70000 lambda_1: -0.5866, lambda_2: 386.3099 lambda_3: 0.0000 train remain: [1. 1. 1. 0.89 0.88 0.84 0.87 0.88 0.84 0.52] infer remain: [1.0, 1.0, 1.0, 0.88, 0.88, 0.84, 0.86, 0.88, 0.84, 0.52] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.65, 0.56, 0.49, 0.41, 0.22] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111101001000 11111111111111111111111111111111111111111111000000 11111111111111111111111111111111111111110110000000 11111111111111111111111111111111111111111100100000 11111111111111111111111111111111111111111110100000 11111111111111111111111111111111111111111100000000 00111111111111111111111110100101000000000000000000 loss: 0.226809, lagrangian_loss: -0.000169, attention_score_distillation_loss: 0.008404 loss: 0.167627, lagrangian_loss: -0.000402, attention_score_distillation_loss: 0.008272 ---------------------------------------------------------------------- time: 2023-07-19 20:34:33 Evaluating: accuracy: 0.828, eval_loss: 0.5813, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3055, expected_sparsity: 0.2936, expected_sequence_sparsity: 0.7881, target_sparsity: 0.2933, step: 72000 lambda_1: -0.8020, lambda_2: 397.0881 lambda_3: 0.0000 train remain: [1. 0.99 0.99 0.87 0.88 0.84 0.87 0.88 0.84 0.52] infer remain: [1.0, 1.0, 1.0, 0.86, 0.88, 0.84, 0.86, 0.88, 0.84, 0.52] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.76, 0.64, 0.55, 0.48, 0.4, 0.21] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111101000000 11111111111111111111111111111111111111111111000000 11111111111111111111111111111111111111110110000000 11111111111111111111111111111111111111111100100000 11111111111111111111111111111111111111111110100000 11111111111111111111111111111111111111111100000000 00011111111111111111111110100101000000001000000000 loss: 0.189793, lagrangian_loss: -0.000240, attention_score_distillation_loss: 0.008135 loss: 0.231220, lagrangian_loss: -0.000339, attention_score_distillation_loss: 0.007949 ETA: 1 day, 11:27:58 | Epoch 5 finished. Took 3803.51 seconds. ---------------------------------------------------------------------- time: 2023-07-19 20:44:56 Evaluating: accuracy: 0.8275, eval_loss: 0.5657, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.314, expected_sparsity: 0.3053, expected_sequence_sparsity: 0.7916, target_sparsity: 0.3015, step: 74000 lambda_1: -0.8154, lambda_2: 409.1535 lambda_3: 0.0000 train remain: [1. 0.99 0.99 0.87 0.88 0.83 0.87 0.88 0.83 0.52] infer remain: [1.0, 1.0, 1.0, 0.86, 0.86, 0.82, 0.86, 0.88, 0.82, 0.52] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.61, 0.52, 0.46, 0.38, 0.2] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111101000000 11111111111111111111111111111111111111111101000000 11111111111111111111111111111111111111010110000000 11111111111111111111111111111111111111111100100000 11111111111111111111111111111111111111111110100000 10111111111111111111111111111111111111111100000000 00011111111111111111111110100101010000000000000000 loss: 0.166661, lagrangian_loss: 0.001064, attention_score_distillation_loss: 0.007547 loss: 0.139480, lagrangian_loss: 0.000011, attention_score_distillation_loss: 0.007498 ---------------------------------------------------------------------- time: 2023-07-19 20:55:16 Evaluating: accuracy: 0.8269, eval_loss: 0.5582, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.314, expected_sparsity: 0.3053, expected_sequence_sparsity: 0.7916, target_sparsity: 0.3096, step: 76000 lambda_1: -0.8575, lambda_2: 420.7556 lambda_3: 0.0000 train remain: [1. 0.99 0.99 0.87 0.87 0.82 0.86 0.88 0.83 0.52] infer remain: [1.0, 1.0, 1.0, 0.86, 0.86, 0.82, 0.86, 0.88, 0.82, 0.52] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.61, 0.52, 0.46, 0.38, 0.2] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111101000000 11111111111111111111111111111111111111111101000000 11111111111111111111111111111111111111010110000000 11111111111111111111111111111111111111111100100000 11111111111111111111111111111111111111111110100000 10111111111111111111111111111111111111111100000000 00011111111111111111111110100101100000000000000000 loss: 0.140443, lagrangian_loss: 0.002676, attention_score_distillation_loss: 0.007287 loss: 0.119519, lagrangian_loss: 0.000315, attention_score_distillation_loss: 0.007339 ---------------------------------------------------------------------- time: 2023-07-19 21:05:41 Evaluating: accuracy: 0.8258, eval_loss: 0.6066, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3205, expected_sparsity: 0.3129, expected_sequence_sparsity: 0.7939, target_sparsity: 0.3178, step: 78000 lambda_1: -1.4418, lambda_2: 432.0468 lambda_3: 0.0000 train remain: [1. 0.99 0.99 0.87 0.87 0.79 0.85 0.88 0.83 0.52] infer remain: [1.0, 1.0, 1.0, 0.86, 0.86, 0.8, 0.84, 0.88, 0.82, 0.52] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.59, 0.5, 0.44, 0.36, 0.19] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111101000000 11111111111111111111111111111111111111111101000000 11111111111111111111111111111111111111010100000000 11111111111111111111111111111111111111111100000000 11111111111111111111111111111111111111111110100000 10111111111111111111111111111111111111111100000000 00011111111111111111111110100101000000100000000000 loss: 0.083602, lagrangian_loss: 0.000983, attention_score_distillation_loss: 0.007096 loss: 0.133840, lagrangian_loss: -0.000515, attention_score_distillation_loss: 0.006746 ---------------------------------------------------------------------- time: 2023-07-19 21:15:57 Evaluating: accuracy: 0.8305, eval_loss: 0.5435, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.3283, expected_sparsity: 0.3195, expected_sequence_sparsity: 0.7959, target_sparsity: 0.3259, step: 80000 lambda_1: -1.5426, lambda_2: 444.3086 lambda_3: 0.0000 train remain: [1. 0.98 0.99 0.86 0.86 0.79 0.85 0.88 0.81 0.51] infer remain: [1.0, 1.0, 1.0, 0.86, 0.86, 0.78, 0.84, 0.88, 0.8, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.58, 0.48, 0.43, 0.34, 0.17] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111101000000 11111111111111111111111111111111111111111101000000 11111111111111111111111111111111111111010000000000 11111111111111111111111111111111111111111100000000 11111111111111111111111111111111111111111110100000 10111111111111111111111110111111111111111100000000 00011111111111111111111110100101000000000000000000 loss: 0.146981, lagrangian_loss: -0.000195, attention_score_distillation_loss: 0.006710 loss: 0.242005, lagrangian_loss: -0.002165, attention_score_distillation_loss: 0.006698 ---------------------------------------------------------------------- time: 2023-07-19 21:26:24 Evaluating: accuracy: 0.8201, eval_loss: 0.5788, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.3604, expected_sparsity: 0.3472, expected_sequence_sparsity: 0.8043, target_sparsity: 0.3341, step: 82000 lambda_1: -1.5755, lambda_2: 456.4297 lambda_3: 0.0000 train remain: [1. 0.97 0.99 0.86 0.86 0.79 0.84 0.86 0.81 0.5 ] infer remain: [1.0, 0.96, 0.98, 0.86, 0.86, 0.78, 0.84, 0.86, 0.8, 0.5] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.81, 0.7, 0.54, 0.46, 0.39, 0.31, 0.16] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111101000000 11111111111111111111111111111111111111111101000000 11111111111111111111111111111111111111010000000000 11111111111111111111111111111111111111111100000000 11111111111111111111111110111111111111111110100000 10111111111111111111111110111111111111111100000000 00011111111111111111111110100101000000000000000000 loss: 0.104465, lagrangian_loss: -0.000840, attention_score_distillation_loss: 0.006489 loss: 0.055115, lagrangian_loss: 0.003851, attention_score_distillation_loss: 0.006220 ---------------------------------------------------------------------- time: 2023-07-19 21:36:45 Evaluating: accuracy: 0.8262, eval_loss: 0.5802, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.3661, expected_sparsity: 0.3521, expected_sequence_sparsity: 0.8058, target_sparsity: 0.3422, step: 84000 lambda_1: -1.6194, lambda_2: 467.4469 lambda_3: 0.0000 train remain: [1. 0.97 0.99 0.86 0.85 0.78 0.84 0.86 0.81 0.5 ] infer remain: [1.0, 0.96, 0.98, 0.86, 0.84, 0.78, 0.84, 0.86, 0.8, 0.5] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.81, 0.68, 0.53, 0.45, 0.38, 0.31, 0.15] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111101000000 11111111111111111111111111111111111111110101000000 11111111111111111111111111111111111111010000000000 11111111111111111111111111111111111111111100000000 11111111111111111111111110111111111111111110100000 10111111111111111111111110111111111111111100000000 00011111111111111111011110100101000000000000000001 loss: 0.099058, lagrangian_loss: -0.000209, attention_score_distillation_loss: 0.006140 loss: 0.254316, lagrangian_loss: -0.000789, attention_score_distillation_loss: 0.005948 ETA: 1 day, 10:29:36 | Epoch 6 finished. Took 3809.06 seconds. ---------------------------------------------------------------------- time: 2023-07-19 21:47:07 Evaluating: accuracy: 0.8256, eval_loss: 0.5762, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.3747, expected_sparsity: 0.3642, expected_sequence_sparsity: 0.8095, target_sparsity: 0.3504, step: 86000 lambda_1: -1.4198, lambda_2: 478.4804 lambda_3: 0.0000 train remain: [1. 0.97 0.99 0.85 0.83 0.78 0.84 0.86 0.81 0.48] infer remain: [1.0, 0.96, 0.98, 0.84, 0.82, 0.78, 0.84, 0.86, 0.8, 0.48] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.79, 0.65, 0.51, 0.42, 0.37, 0.29, 0.14] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111001000000 11111111111111111111111111111111111111100101000000 11111111111111111111111111111111111111010000000000 11111111111111111111111111111111111111111100000000 11111111111111111111111110111111111111111110100000 10111111111111111111111110111111111111111100000000 00011111111111111111011110100101000000000000000000 loss: 0.081235, lagrangian_loss: -0.000364, attention_score_distillation_loss: 0.005828 loss: 0.321234, lagrangian_loss: 0.001089, attention_score_distillation_loss: 0.005584 ---------------------------------------------------------------------- time: 2023-07-19 21:57:27 Evaluating: accuracy: 0.8213, eval_loss: 0.6036, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.3747, expected_sparsity: 0.3642, expected_sequence_sparsity: 0.8095, target_sparsity: 0.3585, step: 88000 lambda_1: -0.8033, lambda_2: 489.6741 lambda_3: 0.0000 train remain: [1. 0.98 0.99 0.83 0.82 0.77 0.83 0.86 0.8 0.48] infer remain: [1.0, 0.96, 0.98, 0.84, 0.82, 0.78, 0.84, 0.86, 0.8, 0.48] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.79, 0.65, 0.51, 0.42, 0.37, 0.29, 0.14] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111001000000 11111111111111111111111111111111111111100101000000 11111111111111111111111111111111111111010000000000 11111111111111111111111111111111111111111100000000 11111111111111111111111110111111111111111110100000 10111111111111111111111110111111111111111100000000 00011111111111111111011110100100000000000000100000 loss: 0.108728, lagrangian_loss: -0.000231, attention_score_distillation_loss: 0.005579 loss: 0.149286, lagrangian_loss: 0.001564, attention_score_distillation_loss: 0.005144 ---------------------------------------------------------------------- time: 2023-07-19 22:07:52 Evaluating: accuracy: 0.8155, eval_loss: 0.6269, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.3931, expected_sparsity: 0.3765, expected_sequence_sparsity: 0.8132, target_sparsity: 0.3667, step: 90000 lambda_1: -1.2292, lambda_2: 500.7760 lambda_3: 0.0000 train remain: [1. 0.97 0.99 0.83 0.81 0.76 0.83 0.86 0.8 0.48] infer remain: [1.0, 0.96, 0.98, 0.82, 0.82, 0.76, 0.82, 0.86, 0.8, 0.48] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.77, 0.63, 0.48, 0.39, 0.34, 0.27, 0.13] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111000000000 11111111111111111111111111111111111111100101000000 11111111111111111111111111111111111011010000000000 11111111111111111111111110111111111111111100000000 11111111111111111111111110111111111111111110100000 10111111111111111111111110111111111111111100000000 00011111111111111111011110100100000000000000000001 loss: 0.066511, lagrangian_loss: -0.000130, attention_score_distillation_loss: 0.005258 loss: 0.253892, lagrangian_loss: 0.001876, attention_score_distillation_loss: 0.004921 ---------------------------------------------------------------------- time: 2023-07-19 22:18:08 Evaluating: accuracy: 0.8149, eval_loss: 0.6355, token_prune_loc: [False, True, False, True, True, True, True, True, True, True], macs_sparsity: 0.391, expected_sparsity: 0.377, expected_sequence_sparsity: 0.8134, target_sparsity: 0.3748, step: 92000 lambda_1: -1.3120, lambda_2: 512.1489 lambda_3: 0.0000 train remain: [1. 0.97 0.99 0.82 0.8 0.76 0.83 0.84 0.79 0.48] infer remain: [1.0, 0.96, 1.0, 0.82, 0.8, 0.76, 0.82, 0.84, 0.78, 0.48] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.96, 0.79, 0.63, 0.48, 0.39, 0.33, 0.26, 0.12] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111000000000 11111111111111111111111110111111111111100101000000 11111111111111111111111111111111111011010000000000 11111111111111111111111110111111111111111100000000 10111111111111111111111110111111111111111110100000 10111111111111111111111110111111111111110100000000 00111111111111111111011110100100000000000000000000 loss: 0.127113, lagrangian_loss: 0.000595, attention_score_distillation_loss: 0.004781 loss: 0.066894, lagrangian_loss: -0.001545, attention_score_distillation_loss: 0.004743 ---------------------------------------------------------------------- time: 2023-07-19 22:28:27 Evaluating: accuracy: 0.8154, eval_loss: 0.6168, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4102, expected_sparsity: 0.3978, expected_sequence_sparsity: 0.8196, target_sparsity: 0.383, step: 94000 lambda_1: -1.3148, lambda_2: 524.2593 lambda_3: 0.0000 train remain: [1. 0.97 0.99 0.81 0.79 0.75 0.83 0.85 0.78 0.47] infer remain: [1.0, 0.96, 0.98, 0.8, 0.78, 0.74, 0.82, 0.84, 0.78, 0.46] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.75, 0.59, 0.43, 0.36, 0.3, 0.23, 0.11] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111110000000000 11111111111111111111111110111111101111100101000000 11111111111111111111111111111111011011010000000000 11111111111111111111111110111111111111111100000000 10111111111111111111111110111111111111111110100000 10111111111111111111111110111111111111110100000000 00011111111111111111011110100100000000000000000000 loss: 0.146433, lagrangian_loss: 0.003324, attention_score_distillation_loss: 0.004428 loss: 0.057346, lagrangian_loss: 0.002999, attention_score_distillation_loss: 0.004355 ---------------------------------------------------------------------- time: 2023-07-19 22:38:42 Evaluating: accuracy: 0.8163, eval_loss: 0.6108, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.418, expected_sparsity: 0.4021, expected_sequence_sparsity: 0.821, target_sparsity: 0.3911, step: 96000 lambda_1: -0.7558, lambda_2: 535.5180 lambda_3: 0.0000 train remain: [1. 0.97 0.99 0.81 0.76 0.75 0.83 0.85 0.78 0.47] infer remain: [1.0, 0.96, 0.98, 0.8, 0.76, 0.74, 0.82, 0.84, 0.78, 0.46] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.75, 0.57, 0.42, 0.35, 0.29, 0.23, 0.1] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111110000000000 11111111111111111111111110111111111011100100000000 11111111111111111111111111111111011011010000000000 11111111111111111111111110111111111111111100000000 10111111111111111111111110111111111111111110100000 10111111111111111111111110111111111111110100000000 00011111110111111111011110100100000000001000000000 loss: 0.144043, lagrangian_loss: 0.000869, attention_score_distillation_loss: 0.004202 loss: 0.128244, lagrangian_loss: 0.000010, attention_score_distillation_loss: 0.004056 ---------------------------------------------------------------------- time: 2023-07-19 22:49:04 Evaluating: accuracy: 0.8204, eval_loss: 0.6013, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4223, expected_sparsity: 0.4064, expected_sequence_sparsity: 0.8223, target_sparsity: 0.3993, step: 98000 lambda_1: -2.2927, lambda_2: 547.2459 lambda_3: 0.0000 train remain: [1. 0.97 0.99 0.81 0.74 0.75 0.83 0.85 0.77 0.46] infer remain: [1.0, 0.96, 0.98, 0.8, 0.74, 0.74, 0.82, 0.84, 0.78, 0.46] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.75, 0.56, 0.41, 0.34, 0.28, 0.22, 0.1] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111110000000000 11111111111111111111111110111111111011100000000000 11111111111111111111111111111111011011010000000000 11111111111111111111111110111111111111111100000000 10111111111111111111111110111111111111111110100000 10111111111111111111111110111111111111110100000000 00011111110111111111011110100100000001000000000000 loss: 0.224389, lagrangian_loss: -0.000301, attention_score_distillation_loss: 0.003897 ETA: 1 day, 9:32:33 | Epoch 7 finished. Took 3847.81 seconds. loss: 0.135725, lagrangian_loss: 0.000305, attention_score_distillation_loss: 0.003770 ---------------------------------------------------------------------- time: 2023-07-19 22:59:22 Evaluating: accuracy: 0.8194, eval_loss: 0.6157, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4223, expected_sparsity: 0.4072, expected_sequence_sparsity: 0.8225, target_sparsity: 0.4074, step: 100000 lambda_1: -1.9534, lambda_2: 558.5091 lambda_3: 0.0000 train remain: [0.99 0.97 0.98 0.81 0.73 0.73 0.83 0.85 0.75 0.46] infer remain: [1.0, 0.96, 0.98, 0.8, 0.74, 0.74, 0.82, 0.84, 0.76, 0.46] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.75, 0.56, 0.41, 0.34, 0.28, 0.22, 0.1] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111110000000000 11111111111111111111111110111111111011100000000000 11111111111111111111111111111111011011010000000000 11111111111111111111111110111111111111111100000000 10111111111111111111111110111111111111111110100000 10111111111111111111111110111111111110110100000000 00011111110111111111011110100100000001000000000000 loss: 0.044169, lagrangian_loss: -0.001094, attention_score_distillation_loss: 0.003605 loss: 0.087274, lagrangian_loss: 0.002346, attention_score_distillation_loss: 0.003394 ---------------------------------------------------------------------- time: 2023-07-19 23:09:44 Evaluating: accuracy: 0.8056, eval_loss: 0.6344, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4329, expected_sparsity: 0.42, expected_sequence_sparsity: 0.8264, target_sparsity: 0.4156, step: 102000 lambda_1: -3.9935, lambda_2: 570.8694 lambda_3: 0.0000 train remain: [0.99 0.96 0.98 0.8 0.71 0.73 0.82 0.84 0.75 0.44] infer remain: [1.0, 0.96, 0.98, 0.8, 0.7, 0.72, 0.82, 0.84, 0.74, 0.44] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.75, 0.53, 0.38, 0.31, 0.26, 0.19, 0.09] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111110000000000 11111111111111111111111110111111101010100000000000 11111111111111111111111111111111011010010000000000 11111111111111111111111110111111111111111100000000 10111111111111111111111110111111111111111110100000 00111111111111111111111110111111111110110100000000 00011111110111111011011110100100000000000000000001 loss: 0.137635, lagrangian_loss: 0.001601, attention_score_distillation_loss: 0.003222 loss: 0.083981, lagrangian_loss: 0.002092, attention_score_distillation_loss: 0.003095 ---------------------------------------------------------------------- time: 2023-07-19 23:20:01 Evaluating: accuracy: 0.8115, eval_loss: 0.639, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4329, expected_sparsity: 0.42, expected_sequence_sparsity: 0.8264, target_sparsity: 0.4237, step: 104000 lambda_1: -2.2356, lambda_2: 582.7551 lambda_3: 0.0000 train remain: [0.99 0.96 0.99 0.8 0.69 0.72 0.82 0.85 0.74 0.44] infer remain: [1.0, 0.96, 0.98, 0.8, 0.7, 0.72, 0.82, 0.84, 0.74, 0.44] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.75, 0.53, 0.38, 0.31, 0.26, 0.19, 0.09] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111110000000000 11111111111111111111111110111111101010100000000000 11111111111111111111111111111111011010010000000000 11111111111111111111111110111111111111111100000000 10111111111111111111111110111111111111111110100000 00111111111111111111111110111111111110110100000000 00011111110111111011011110101100000000000000000000 loss: 0.236071, lagrangian_loss: 0.002409, attention_score_distillation_loss: 0.002962 loss: 0.104091, lagrangian_loss: -0.001018, attention_score_distillation_loss: 0.002784 ---------------------------------------------------------------------- time: 2023-07-19 23:30:20 Evaluating: accuracy: 0.7981, eval_loss: 0.7098, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4372, expected_sparsity: 0.4257, expected_sequence_sparsity: 0.8281, target_sparsity: 0.4319, step: 106000 lambda_1: -2.0419, lambda_2: 594.3987 lambda_3: 0.0000 train remain: [0.99 0.97 0.98 0.78 0.69 0.71 0.82 0.85 0.74 0.43] infer remain: [1.0, 0.96, 0.98, 0.78, 0.7, 0.72, 0.82, 0.84, 0.74, 0.42] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.73, 0.51, 0.37, 0.3, 0.25, 0.19, 0.08] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111010000000000 11111111111111111111111110111110101010100000001000 11111111111111111111111111111111011010010000000000 11111111111111111111111110111111111111111100000000 10111111111111111111111110111111111111111110100000 00111111111111111111111110111111111110110100000000 00011111110111111011011110100100000000000000000000 loss: 0.234830, lagrangian_loss: -0.000503, attention_score_distillation_loss: 0.002689 loss: 0.304385, lagrangian_loss: -0.002101, attention_score_distillation_loss: 0.002484 ---------------------------------------------------------------------- time: 2023-07-19 23:40:36 Evaluating: accuracy: 0.7986, eval_loss: 0.6651, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4621, expected_sparsity: 0.4483, expected_sequence_sparsity: 0.8349, target_sparsity: 0.44, step: 108000 lambda_1: -3.1086, lambda_2: 605.8508 lambda_3: 0.0000 train remain: [0.99 0.96 0.98 0.77 0.69 0.71 0.8 0.84 0.74 0.37] infer remain: [0.98, 0.96, 0.98, 0.76, 0.68, 0.7, 0.8, 0.84, 0.74, 0.38] layerwise remain: [1.0, 1.0, 0.98, 0.94, 0.92, 0.7, 0.48, 0.33, 0.27, 0.22, 0.17, 0.06] 11111111111111111111111111111111111111111111111110 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111101111010000000000 11111111111111111111111110111110101010100000000000 11111111111111111111111110111111011010010000000000 11111111111111111111111110111111111111110100000000 10111111111111111111111110111111111111111110100000 00111111111111111111111110111111111110110100000000 00011111110111111011001110000100000000000000000000 loss: 0.299455, lagrangian_loss: 0.004507, attention_score_distillation_loss: 0.002311 loss: 0.106518, lagrangian_loss: 0.000609, attention_score_distillation_loss: 0.002116 ---------------------------------------------------------------------- time: 2023-07-19 23:50:51 Evaluating: accuracy: 0.8086, eval_loss: 0.6279, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4614, expected_sparsity: 0.4462, expected_sequence_sparsity: 0.8343, target_sparsity: 0.4482, step: 110000 lambda_1: -1.6852, lambda_2: 617.2943 lambda_3: 0.0000 train remain: [0.99 0.96 0.98 0.75 0.67 0.7 0.79 0.84 0.73 0.35] infer remain: [1.0, 0.96, 0.98, 0.74, 0.68, 0.7, 0.8, 0.84, 0.72, 0.34] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.7, 0.47, 0.33, 0.27, 0.22, 0.16, 0.05] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111111101111000000000000 11111111111111111111111110111110101010100000000000 11111111111111111111111110111111011010010000000000 11111111111111111111111110111111111111110100000000 10111111111111111111111110111111111111111110100000 00111111111111111111111110101111111110110100000000 00001111110111111011001010000100000000000000000000 loss: 0.209008, lagrangian_loss: -0.001150, attention_score_distillation_loss: 0.002020 ETA: 1 day, 8:30:12 | Epoch 8 finished. Took 3783.0 seconds. loss: 0.093049, lagrangian_loss: -0.002166, attention_score_distillation_loss: 0.001883 ---------------------------------------------------------------------- time: 2023-07-20 00:01:10 Evaluating: accuracy: 0.8102, eval_loss: 0.6184, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.472, expected_sparsity: 0.4561, expected_sequence_sparsity: 0.8373, target_sparsity: 0.4563, step: 112000 lambda_1: -1.6085, lambda_2: 628.7651 lambda_3: 0.0000 train remain: [0.99 0.97 0.98 0.73 0.67 0.7 0.77 0.84 0.71 0.33] infer remain: [1.0, 0.96, 0.98, 0.72, 0.66, 0.7, 0.78, 0.84, 0.72, 0.34] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.68, 0.45, 0.31, 0.24, 0.21, 0.15, 0.05] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111101101111000000000000 11111111111111111111111110111110101010000000000000 11111111111111111111111110111111011010010000000000 10111111111111111111111110111111111111110100000000 10111111111111111111111110111111111111111110100000 00111111111111111111111110101111111110110100000000 00001111110111111011001010000100000000000000000000 loss: 0.109491, lagrangian_loss: 0.005070, attention_score_distillation_loss: 0.001665 loss: 0.213326, lagrangian_loss: -0.001050, attention_score_distillation_loss: 0.001551 ---------------------------------------------------------------------- time: 2023-07-20 00:11:29 Evaluating: accuracy: 0.7944, eval_loss: 0.6889, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.4913, expected_sparsity: 0.4742, expected_sequence_sparsity: 0.8428, target_sparsity: 0.4645, step: 114000 lambda_1: -1.4105, lambda_2: 640.2189 lambda_3: 0.0000 train remain: [0.98 0.96 0.99 0.71 0.66 0.7 0.77 0.84 0.71 0.33] infer remain: [0.96, 0.96, 0.98, 0.72, 0.66, 0.7, 0.76, 0.84, 0.7, 0.34] layerwise remain: [1.0, 1.0, 0.96, 0.92, 0.9, 0.65, 0.43, 0.3, 0.23, 0.19, 0.13, 0.05] 10111111111111111111111111111111111111111111111110 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111101101111000000000000 11111111111111111111111110111110101010000000000000 11111111111111111111111110111111011010010000000000 10111111111111111111111110111111111111110000000000 10111111111111111111111110111111111111111110100000 00111111111111111111111110101111111110110000000000 10001111110111101011001010000000000001000000000000 loss: 0.152163, lagrangian_loss: 0.000406, attention_score_distillation_loss: 0.001389 loss: 0.072335, lagrangian_loss: 0.002114, attention_score_distillation_loss: 0.001224 ---------------------------------------------------------------------- time: 2023-07-20 00:21:47 Evaluating: accuracy: 0.7989, eval_loss: 0.676, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.502, expected_sparsity: 0.4839, expected_sequence_sparsity: 0.8457, target_sparsity: 0.4726, step: 116000 lambda_1: -1.6965, lambda_2: 651.9952 lambda_3: 0.0000 train remain: [0.98 0.96 0.98 0.7 0.65 0.69 0.77 0.83 0.69 0.33] infer remain: [0.96, 0.96, 0.98, 0.7, 0.64, 0.68, 0.76, 0.84, 0.7, 0.34] layerwise remain: [1.0, 1.0, 0.96, 0.92, 0.9, 0.63, 0.4, 0.28, 0.21, 0.18, 0.12, 0.04] 10111111111111111111111111111111111111111111111110 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111101101110000000000000 11111111111111111111111110111110001010000000000000 11111111111111111111111110110111011010010000000000 10111111111111111111111110111111111111110000000000 10111111111111111111111110111111111111111110100000 00111111111111111111111110101111011110110100000000 00001111110111111011001010000000000000000001000000 loss: 0.203861, lagrangian_loss: -0.000070, attention_score_distillation_loss: 0.001082 loss: 0.206913, lagrangian_loss: -0.001946, attention_score_distillation_loss: 0.000912 ---------------------------------------------------------------------- time: 2023-07-20 00:32:04 Evaluating: accuracy: 0.7982, eval_loss: 0.6382, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5041, expected_sparsity: 0.4892, expected_sequence_sparsity: 0.8473, target_sparsity: 0.4808, step: 118000 lambda_1: -2.4659, lambda_2: 662.7611 lambda_3: 0.0000 train remain: [0.97 0.96 0.98 0.69 0.64 0.68 0.76 0.83 0.69 0.33] infer remain: [0.96, 0.96, 0.98, 0.68, 0.64, 0.68, 0.76, 0.82, 0.7, 0.32] layerwise remain: [1.0, 1.0, 0.96, 0.92, 0.9, 0.61, 0.39, 0.27, 0.2, 0.17, 0.12, 0.04] 10111111111111111111111111111111111111111111111110 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111101101010000000000000 11111111111111111111111110111110001010000000000000 11111111111111111111111110110111011010010000000000 10111111111111111111111110111111111111110000000000 10111111111111111111111110111111111111111100100000 00111111111111111111111110101111011110110000000100 00001111110111101011001010000000000000000000000001 loss: 0.400744, lagrangian_loss: 0.000024, attention_score_distillation_loss: 0.000751 loss: 0.071248, lagrangian_loss: 0.000745, attention_score_distillation_loss: 0.000589 ---------------------------------------------------------------------- time: 2023-07-20 00:42:22 Evaluating: accuracy: 0.8009, eval_loss: 0.6733, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5062, expected_sparsity: 0.4926, expected_sequence_sparsity: 0.8484, target_sparsity: 0.4889, step: 120000 lambda_1: -3.3831, lambda_2: 674.8969 lambda_3: 0.0000 train remain: [0.97 0.96 0.98 0.68 0.62 0.67 0.76 0.82 0.69 0.33] infer remain: [0.96, 0.96, 0.98, 0.68, 0.62, 0.68, 0.76, 0.82, 0.68, 0.32] layerwise remain: [1.0, 1.0, 0.96, 0.92, 0.9, 0.61, 0.38, 0.26, 0.2, 0.16, 0.11, 0.04] 10111111111111111111111111111111111111111111111110 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111111111101101010000000000000 11111111111111111111111110110110001010000000000000 11111111111111111111111110110111011010010000000000 10111111111111111111111110111111111111110000000000 10111111111111111111111110111111111111111100100000 00111111111111111111111110101111011110110000000000 10001111110111101011001010000000000000000000000000 loss: 0.211470, lagrangian_loss: 0.013179, attention_score_distillation_loss: 0.000420 loss: 0.210821, lagrangian_loss: 0.027574, attention_score_distillation_loss: 0.000262 ---------------------------------------------------------------------- time: 2023-07-20 00:52:41 Evaluating: accuracy: 0.7965, eval_loss: 0.6614, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5126, expected_sparsity: 0.4981, expected_sequence_sparsity: 0.85, target_sparsity: 0.4971, step: 122000 lambda_1: -5.1370, lambda_2: 686.0227 lambda_3: 0.0000 train remain: [0.96 0.96 0.98 0.66 0.61 0.67 0.75 0.8 0.66 0.3 ] infer remain: [0.96, 0.96, 0.98, 0.66, 0.62, 0.68, 0.76, 0.8, 0.66, 0.3] layerwise remain: [1.0, 1.0, 0.96, 0.92, 0.9, 0.6, 0.37, 0.25, 0.19, 0.15, 0.1, 0.03] 10111111111111111111111111111111111111111111111110 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101101010000000000000 11111111111111111111111110110110010010000000000000 11111111111111111111111110110111011010000000000100 10111111111111111111111110111111111111110000000000 10111111110111111111111110111111111111111100100000 00111111111111111111111010101111011110110000000000 00001111110111101011001010000000000000000000000000 loss: 0.147771, lagrangian_loss: -0.008677, attention_score_distillation_loss: 0.000197 ETA: 1 day, 7:28:03 | Epoch 9 finished. Took 3789.92 seconds. loss: 0.080712, lagrangian_loss: 0.007763, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 01:03:02 Evaluating: accuracy: 0.7996, eval_loss: 0.6635, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5126, expected_sparsity: 0.4995, expected_sequence_sparsity: 0.8505, target_sparsity: 0.5, step: 124000 lambda_1: -2.3202, lambda_2: 697.5302 lambda_3: 0.0000 train remain: [0.96 0.96 0.98 0.65 0.61 0.67 0.75 0.79 0.64 0.3 ] infer remain: [0.96, 0.96, 0.98, 0.66, 0.62, 0.68, 0.74, 0.8, 0.64, 0.3] layerwise remain: [1.0, 1.0, 0.96, 0.92, 0.9, 0.6, 0.37, 0.25, 0.19, 0.15, 0.1, 0.03] 10111111111111111111111111111111111111111111111110 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001010100000000000 11111111111111111111111110110110000010000000100000 11111111111111111111111110110111011110000000000000 10111111111111111111111110111111111011110000000000 10111111110111111111111110111111111111111100100000 00111111111111111111011010101111011110110000000000 10001111110111101010001010000000000000000000000000 loss: 0.108752, lagrangian_loss: -0.001888, attention_score_distillation_loss: 0.000193 loss: 0.209096, lagrangian_loss: -0.001531, attention_score_distillation_loss: 0.000194 Starting saving the best from epoch 10 and step 126000 ---------------------------------------------------------------------- time: 2023-07-20 01:13:28 Evaluating: accuracy: 0.7971, eval_loss: 0.6619, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5183, expected_sparsity: 0.503, expected_sequence_sparsity: 0.8515, target_sparsity: 0.5, step: 126000 lambda_1: -1.3344, lambda_2: 709.7443 lambda_3: 0.0000 train remain: [0.97 0.96 0.98 0.65 0.61 0.67 0.75 0.77 0.62 0.28] infer remain: [0.96, 0.96, 0.98, 0.66, 0.62, 0.66, 0.74, 0.76, 0.62, 0.28] layerwise remain: [1.0, 1.0, 0.96, 0.92, 0.9, 0.6, 0.37, 0.24, 0.18, 0.14, 0.09, 0.02] 10111111111111111111111111111111111111111111111110 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111111001010000000000000 11111111111111111111111110110110000010000000100000 11111111111111111111111110110111011010000000000000 10111111111111111111111110111111111011110000000000 10111111110111111111111110110111011111111100100000 00111111110111111111011010101111011110110000000000 10001111010111101010001010000000000000000000000000 Saving the best model so far: [Epoch 10 | Step: 126000 | MACs sparsity: 0.5183 | Score: 0.7971 | Loss: 0.6619] loss: 0.305361, lagrangian_loss: -0.000501, attention_score_distillation_loss: 0.000195 loss: 0.301266, lagrangian_loss: 0.001701, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 01:24:02 Evaluating: accuracy: 0.7798, eval_loss: 0.7082, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5225, expected_sparsity: 0.507, expected_sequence_sparsity: 0.8527, target_sparsity: 0.5, step: 128000 lambda_1: -1.4333, lambda_2: 720.3761 lambda_3: 0.0000 train remain: [0.97 0.97 0.98 0.64 0.61 0.67 0.75 0.76 0.62 0.28] infer remain: [0.96, 0.96, 0.98, 0.64, 0.62, 0.66, 0.74, 0.76, 0.62, 0.28] layerwise remain: [1.0, 1.0, 0.96, 0.92, 0.9, 0.58, 0.36, 0.24, 0.18, 0.13, 0.08, 0.02] 10111111111111111111111111111111111111111111111110 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001010000000000000 11111111111111111111111110110111000010000000000000 11111111111111111111111110110111011010000000000000 10111111111111111111111110111111111011110000000000 10111111110111111111111110111111011111110100100000 00111111110111111111011010001111011110110000100000 00001111010110101011011010000000000000000000000000 Best eval score so far: 0.7971 @ step 126000 epoch 10.27 loss: 0.154010, lagrangian_loss: -0.000334, attention_score_distillation_loss: 0.000195 loss: 0.306660, lagrangian_loss: 0.001129, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-20 01:34:18 Evaluating: accuracy: 0.7906, eval_loss: 0.7126, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5204, expected_sparsity: 0.5054, expected_sequence_sparsity: 0.8522, target_sparsity: 0.5, step: 130000 lambda_1: -0.8938, lambda_2: 731.5958 lambda_3: 0.0000 train remain: [0.97 0.97 0.98 0.64 0.61 0.67 0.75 0.75 0.62 0.28] infer remain: [0.96, 0.96, 0.98, 0.64, 0.62, 0.68, 0.74, 0.76, 0.62, 0.28] layerwise remain: [1.0, 1.0, 0.96, 0.92, 0.9, 0.58, 0.36, 0.24, 0.18, 0.14, 0.08, 0.02] 10111111111111111111111111111111111111111111111110 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001010000000000000 11111111111111111111111110110110000010000000001000 11111111111111111111111110110111011010100000000000 10111111111111111111111110111111111011110000000000 10111111110111111111111110111111011111110100100000 00111111110111111111011010011111011110110000000000 00001111010110101010001010000000000000010100000000 Best eval score so far: 0.7971 @ step 126000 epoch 10.27 loss: 0.237739, lagrangian_loss: -0.000128, attention_score_distillation_loss: 0.000197 loss: 0.114958, lagrangian_loss: -0.000475, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-20 01:44:36 Evaluating: accuracy: 0.7955, eval_loss: 0.6718, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5225, expected_sparsity: 0.5059, expected_sequence_sparsity: 0.8524, target_sparsity: 0.5, step: 132000 lambda_1: -1.3734, lambda_2: 742.8481 lambda_3: 0.0000 train remain: [0.97 0.97 0.98 0.64 0.61 0.67 0.75 0.74 0.62 0.27] infer remain: [0.96, 0.96, 0.98, 0.64, 0.62, 0.68, 0.74, 0.74, 0.62, 0.28] layerwise remain: [1.0, 1.0, 0.96, 0.92, 0.9, 0.58, 0.36, 0.24, 0.18, 0.13, 0.08, 0.02] 10111111111111111111111111111111111111111111111110 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000001000000 11111111111111111111111110110110000010000010000000 11111111111111111111111110110111011011000000000000 10111111111111111111111110111111111011110000000000 10111111110111111111111110110111011111110100100000 00111111110111111111011010011111011110110000000000 10001111010110101010011010000000000000000000000000 Best eval score so far: 0.7971 @ step 126000 epoch 10.27 loss: 0.176480, lagrangian_loss: -0.000337, attention_score_distillation_loss: 0.000193 loss: 0.174163, lagrangian_loss: 0.003387, attention_score_distillation_loss: 0.000191 ---------------------------------------------------------------------- time: 2023-07-20 01:54:55 Evaluating: accuracy: 0.7971, eval_loss: 0.6585, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5246, expected_sparsity: 0.5106, expected_sequence_sparsity: 0.8538, target_sparsity: 0.5, step: 134000 lambda_1: -1.3660, lambda_2: 754.4758 lambda_3: 0.0000 train remain: [0.98 0.98 0.98 0.64 0.6 0.67 0.75 0.73 0.61 0.27] infer remain: [0.96, 0.96, 0.98, 0.64, 0.6, 0.66, 0.74, 0.74, 0.6, 0.26] layerwise remain: [1.0, 1.0, 0.96, 0.92, 0.9, 0.58, 0.35, 0.23, 0.17, 0.13, 0.08, 0.02] 10111111111111111111111111111111111111111111111110 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000001000000000 11111111111111111111111110110110000010000000000000 11111111111111111111111110110111011010000000000000 10111111111111111111111110111111111011110000000000 10011111110111111111111110110111011111111100100000 00111111110111111111011010001111011110110000000000 00001111010110101010001010000000010000000000000000 Best eval score so far: 0.7971 @ step 126000 epoch 10.27 loss: 0.292469, lagrangian_loss: 0.000349, attention_score_distillation_loss: 0.000196 ETA: 1 day, 6:26:36 | Epoch 10 finished. Took 3809.81 seconds. loss: 0.372534, lagrangian_loss: 0.000836, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-20 02:05:21 Evaluating: accuracy: 0.8043, eval_loss: 0.6594, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5246, expected_sparsity: 0.5106, expected_sequence_sparsity: 0.8538, target_sparsity: 0.5, step: 136000 lambda_1: -1.0511, lambda_2: 765.5814 lambda_3: 0.0000 train remain: [0.98 0.98 0.98 0.64 0.6 0.65 0.74 0.73 0.59 0.25] infer remain: [0.96, 0.96, 0.98, 0.64, 0.6, 0.66, 0.74, 0.74, 0.6, 0.26] layerwise remain: [1.0, 1.0, 0.96, 0.92, 0.9, 0.58, 0.35, 0.23, 0.17, 0.13, 0.08, 0.02] 10111111111111111111111111111111111111111111111110 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001001000000000000 11111111111111111111111110110110000010000000000000 11111111111111111111111110110111011010000000000000 10111111111111111111111110111111111011110000000000 10011111110111111111111110110111011111111100100000 00111111110111111111011010001111011110110000000000 00001111010110101010001010000100000000000000000000 Best eval score so far: 0.7971 @ step 126000 epoch 10.27 Saving the best model so far: [Epoch 11 | Step: 136000 | MACs sparsity: 0.5246 | Score: 0.8043 | Loss: 0.6594] loss: 0.162595, lagrangian_loss: 0.001356, attention_score_distillation_loss: 0.000191 loss: 0.195997, lagrangian_loss: -0.000147, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 02:15:59 Evaluating: accuracy: 0.8093, eval_loss: 0.6172, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5139, expected_sparsity: 0.4968, expected_sequence_sparsity: 0.8496, target_sparsity: 0.5, step: 138000 lambda_1: -1.4811, lambda_2: 776.6506 lambda_3: 0.0000 train remain: [0.98 0.97 0.98 0.64 0.6 0.65 0.73 0.73 0.57 0.24] infer remain: [1.0, 0.96, 0.98, 0.64, 0.6, 0.66, 0.74, 0.74, 0.56, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.6, 0.36, 0.24, 0.18, 0.13, 0.07, 0.02] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101101000000000000000 11111111111111111111111110111100000010000000000000 10111111111111111111111110111111011010000000000000 10111111111111111111111110111111111011110000000000 10011111111111111111111110110111011111110100100000 00111111110111101011011010001111011110110000000000 00001111010010101010001010000000000000000100000000 Best eval score so far: 0.8043 @ step 136000 epoch 11.08 Saving the best model so far: [Epoch 11 | Step: 138000 | MACs sparsity: 0.5139 | Score: 0.8093 | Loss: 0.6172] loss: 0.280431, lagrangian_loss: -0.000144, attention_score_distillation_loss: 0.000197 loss: 0.098799, lagrangian_loss: -0.000009, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 02:26:21 Evaluating: accuracy: 0.8016, eval_loss: 0.6703, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.516, expected_sparsity: 0.4984, expected_sequence_sparsity: 0.8501, target_sparsity: 0.5, step: 140000 lambda_1: -1.1562, lambda_2: 787.7200 lambda_3: 0.0000 train remain: [0.98 0.97 0.98 0.64 0.6 0.65 0.73 0.73 0.56 0.23] infer remain: [1.0, 0.96, 0.98, 0.64, 0.6, 0.64, 0.74, 0.74, 0.56, 0.24] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.6, 0.36, 0.23, 0.17, 0.13, 0.07, 0.02] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000000000100 11111111111111111111111110110100000010000010000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111111111011110000000000 10011111110111111111111110111111011111110100100000 10011111110111101011011010001111011110110000000000 00001111010010101011000010000000000100000000000000 Best eval score so far: 0.8093 @ step 138000 epoch 11.25 loss: 0.144605, lagrangian_loss: -0.000362, attention_score_distillation_loss: 0.000197 loss: 0.245779, lagrangian_loss: 0.003554, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 02:36:43 Evaluating: accuracy: 0.8061, eval_loss: 0.6357, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5268, expected_sparsity: 0.5131, expected_sequence_sparsity: 0.8546, target_sparsity: 0.5, step: 142000 lambda_1: -1.3141, lambda_2: 798.9912 lambda_3: 0.0000 train remain: [0.98 0.98 0.98 0.64 0.6 0.65 0.73 0.73 0.56 0.22] infer remain: [0.96, 0.96, 0.98, 0.64, 0.6, 0.64, 0.74, 0.74, 0.56, 0.22] layerwise remain: [1.0, 1.0, 0.96, 0.92, 0.9, 0.58, 0.35, 0.22, 0.16, 0.12, 0.07, 0.01] 10111111111111111111111111111111111111111111111110 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000000100000 11111111111111111111111110110100000010010000000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111111111011110000000000 10011111110111111111111110110111011111111100100000 00011111110111101011111010001111011110110000000000 00001111010010101010000010000000000000000010000000 Best eval score so far: 0.8093 @ step 138000 epoch 11.25 loss: 0.347282, lagrangian_loss: 0.000161, attention_score_distillation_loss: 0.000197 loss: 0.243104, lagrangian_loss: 0.001134, attention_score_distillation_loss: 0.000198 ---------------------------------------------------------------------- time: 2023-07-20 02:47:01 Evaluating: accuracy: 0.8097, eval_loss: 0.6199, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.516, expected_sparsity: 0.499, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 144000 lambda_1: -1.3071, lambda_2: 810.0630 lambda_3: 0.0000 train remain: [0.98 0.98 0.98 0.64 0.6 0.65 0.73 0.73 0.56 0.22] infer remain: [1.0, 0.96, 0.98, 0.64, 0.6, 0.64, 0.74, 0.72, 0.56, 0.22] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.6, 0.36, 0.23, 0.17, 0.12, 0.07, 0.02] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000000000100 11111111111111111111111110110100000010100000000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101111111110000000000 10011111110111111111111110110111011111110100100000 00011111110111101011011010011111011110110000000000 00000111010010101010000010010000010000000000000000 Best eval score so far: 0.8093 @ step 138000 epoch 11.25 Saving the best model so far: [Epoch 11 | Step: 144000 | MACs sparsity: 0.516 | Score: 0.8097 | Loss: 0.6199] loss: 0.188464, lagrangian_loss: 0.001622, attention_score_distillation_loss: 0.000192 loss: 0.181171, lagrangian_loss: -0.000170, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-20 02:57:39 Evaluating: accuracy: 0.809, eval_loss: 0.6157, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5139, expected_sparsity: 0.4975, expected_sequence_sparsity: 0.8498, target_sparsity: 0.5, step: 146000 lambda_1: -0.6733, lambda_2: 821.9365 lambda_3: 0.0000 train remain: [0.98 0.97 0.99 0.64 0.6 0.65 0.74 0.73 0.56 0.22] infer remain: [1.0, 0.96, 0.98, 0.64, 0.6, 0.66, 0.74, 0.72, 0.56, 0.22] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.6, 0.36, 0.24, 0.18, 0.13, 0.07, 0.02] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000000001000 11111111111111111111111110110100000010000000000100 10111111111111111111111110110111011010000100000000 10111111111111111111111110111101111111110000000000 10011111110111111111111110110111011111110100100000 00011111110111101011011010001111011110111000000000 00000111110010101010000010000000000001000000000000 Best eval score so far: 0.8097 @ step 144000 epoch 11.73 loss: 0.385289, lagrangian_loss: 0.001562, attention_score_distillation_loss: 0.000190 loss: 0.314096, lagrangian_loss: 0.000223, attention_score_distillation_loss: 0.000193 ETA: 1 day, 5:25:57 | Epoch 11 finished. Took 3839.37 seconds. ---------------------------------------------------------------------- time: 2023-07-20 03:08:02 Evaluating: accuracy: 0.8057, eval_loss: 0.5928, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5268, expected_sparsity: 0.5121, expected_sequence_sparsity: 0.8542, target_sparsity: 0.5, step: 148000 lambda_1: -1.6870, lambda_2: 833.2999 lambda_3: 0.0000 train remain: [0.98 0.98 0.98 0.64 0.6 0.66 0.74 0.73 0.55 0.22] infer remain: [0.96, 0.96, 0.98, 0.64, 0.6, 0.66, 0.74, 0.72, 0.56, 0.22] layerwise remain: [1.0, 1.0, 0.96, 0.92, 0.9, 0.58, 0.35, 0.23, 0.17, 0.12, 0.07, 0.02] 10111111111111111111111111111111111111111111111110 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001001000000000000 11111111111111111111111110110101000010000000000000 10111111111111111111111110110111011011000000000000 10111111111111111111111110111101111011110100000000 10011111110111111111111110110111011111110100100000 00011111110111101011011010001111011110110000100000 00000111010010101010000010010000000001000000000000 Best eval score so far: 0.8097 @ step 144000 epoch 11.73 loss: 0.160452, lagrangian_loss: -0.000845, attention_score_distillation_loss: 0.000192 loss: 0.278585, lagrangian_loss: 0.004359, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-20 03:18:26 Evaluating: accuracy: 0.809, eval_loss: 0.6141, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5139, expected_sparsity: 0.4978, expected_sequence_sparsity: 0.8499, target_sparsity: 0.5, step: 150000 lambda_1: -0.8849, lambda_2: 844.8425 lambda_3: 0.0000 train remain: [0.98 0.98 0.99 0.64 0.6 0.66 0.74 0.73 0.54 0.22] infer remain: [1.0, 0.96, 0.98, 0.64, 0.6, 0.66, 0.74, 0.72, 0.54, 0.22] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.6, 0.36, 0.24, 0.18, 0.13, 0.07, 0.02] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001100000000000000 11111111111111111111111110110100000010000000010000 11111111111111111111111110110111011010000000000000 10111111111111111111111110111101111011110100000000 10011111110111111111111110110111011111110100100000 00011111110111101011011010001111011110110000000000 10000111010010101010000010000000000000010000000000 Best eval score so far: 0.8097 @ step 144000 epoch 11.73 loss: 0.111988, lagrangian_loss: 0.000474, attention_score_distillation_loss: 0.000191 loss: 0.114699, lagrangian_loss: 0.003347, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 03:28:44 Evaluating: accuracy: 0.8159, eval_loss: 0.6239, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5139, expected_sparsity: 0.4978, expected_sequence_sparsity: 0.8499, target_sparsity: 0.5, step: 152000 lambda_1: -1.2129, lambda_2: 855.6608 lambda_3: 0.0000 train remain: [0.98 0.98 0.98 0.64 0.6 0.65 0.73 0.73 0.54 0.22] infer remain: [1.0, 0.96, 0.98, 0.64, 0.6, 0.66, 0.74, 0.72, 0.54, 0.22] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.6, 0.36, 0.24, 0.18, 0.13, 0.07, 0.02] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111111001000000000000000 11111111111111111111111110110100000010010000000000 10111111111111111111111110110111011011000000000000 10111111111111111111111110111101111011110000100000 10011111110111111111111110110111011111110100100000 00011111110111101011011010001101011110110000100000 10000111010010101010000010000001000000000000000000 Best eval score so far: 0.8097 @ step 144000 epoch 11.73 Saving the best model so far: [Epoch 12 | Step: 152000 | MACs sparsity: 0.5139 | Score: 0.8159 | Loss: 0.6239] loss: 0.072976, lagrangian_loss: 0.001557, attention_score_distillation_loss: 0.000188 loss: 0.153279, lagrangian_loss: -0.000421, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 03:39:13 Evaluating: accuracy: 0.8148, eval_loss: 0.5968, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5139, expected_sparsity: 0.4978, expected_sequence_sparsity: 0.8499, target_sparsity: 0.5, step: 154000 lambda_1: -0.9173, lambda_2: 866.8769 lambda_3: 0.0000 train remain: [0.98 0.98 0.98 0.64 0.6 0.65 0.73 0.72 0.54 0.22] infer remain: [1.0, 0.96, 0.98, 0.64, 0.6, 0.66, 0.74, 0.72, 0.54, 0.22] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.6, 0.36, 0.24, 0.18, 0.13, 0.07, 0.02] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000100000000 11111111111111111111111110110100000110000000000000 10111111111111111111111110110111011011000000000000 10111111111111111111111110111101111011110000010000 10011111110111111111111110110111011111110100100000 00011111110111101011011010101101011110110000000000 00000111110010101010000010000000000100000000000000 Best eval score so far: 0.8159 @ step 152000 epoch 12.39 loss: 0.188022, lagrangian_loss: 0.002231, attention_score_distillation_loss: 0.000195 loss: 0.191253, lagrangian_loss: 0.004047, attention_score_distillation_loss: 0.000190 ---------------------------------------------------------------------- time: 2023-07-20 03:49:30 Evaluating: accuracy: 0.8117, eval_loss: 0.618, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.516, expected_sparsity: 0.4993, expected_sequence_sparsity: 0.8504, target_sparsity: 0.5, step: 156000 lambda_1: -1.4911, lambda_2: 877.7793 lambda_3: 0.0000 train remain: [0.98 0.98 0.98 0.64 0.6 0.65 0.73 0.71 0.54 0.22] infer remain: [1.0, 0.96, 0.98, 0.64, 0.6, 0.64, 0.74, 0.72, 0.54, 0.22] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.6, 0.36, 0.23, 0.17, 0.12, 0.07, 0.01] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001010000000000000 11111111111111111111111110110100000010000000100000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101111011110000010000 10011111110111111111111110110111011111110100100000 00011111110111101011011010011101011110110000000000 10000111010010101011000010000000000000000000000000 Best eval score so far: 0.8159 @ step 152000 epoch 12.39 loss: 0.147306, lagrangian_loss: 0.000563, attention_score_distillation_loss: 0.000193 loss: 0.400239, lagrangian_loss: -0.000335, attention_score_distillation_loss: 0.000197 ---------------------------------------------------------------------- time: 2023-07-20 03:59:51 Evaluating: accuracy: 0.8191, eval_loss: 0.599, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5139, expected_sparsity: 0.4979, expected_sequence_sparsity: 0.85, target_sparsity: 0.5, step: 158000 lambda_1: -0.7731, lambda_2: 888.7317 lambda_3: 0.0000 train remain: [0.98 0.98 0.98 0.64 0.6 0.66 0.73 0.71 0.54 0.21] infer remain: [1.0, 0.96, 0.98, 0.64, 0.6, 0.66, 0.74, 0.72, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.6, 0.36, 0.24, 0.18, 0.13, 0.07, 0.01] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000010000000 11111111111111111111111110111100000010000000000000 10111111111111111111111110110111111010000000000000 10111111111111111111111110111101111011110100000000 10011111110111111111111110110111011111111000100000 00011111110111101011011010001101011110110010000000 00000111010010101010001010000000000000000000000000 Best eval score so far: 0.8159 @ step 152000 epoch 12.39 Saving the best model so far: [Epoch 12 | Step: 158000 | MACs sparsity: 0.5139 | Score: 0.8191 | Loss: 0.599] loss: 0.085934, lagrangian_loss: 0.000347, attention_score_distillation_loss: 0.000193 loss: 0.156551, lagrangian_loss: 0.002157, attention_score_distillation_loss: 0.000191 ETA: 1 day, 4:24:02 | Epoch 12 finished. Took 3817.63 seconds. ---------------------------------------------------------------------- time: 2023-07-20 04:10:21 Evaluating: accuracy: 0.8111, eval_loss: 0.6303, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5139, expected_sparsity: 0.4979, expected_sequence_sparsity: 0.85, target_sparsity: 0.5, step: 160000 lambda_1: -0.8783, lambda_2: 900.2420 lambda_3: 0.0000 train remain: [0.98 0.98 0.98 0.64 0.6 0.66 0.73 0.71 0.54 0.2 ] infer remain: [1.0, 0.96, 0.98, 0.64, 0.6, 0.66, 0.74, 0.72, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.94, 0.6, 0.36, 0.24, 0.18, 0.13, 0.07, 0.01] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111111001000000000000000 11111111111111111111111110110100000010000010000000 10111111111111111111111110110111011011000000000000 10111111111111111111111110111101111011110100000000 10011111110111111111111110110111011111111000100000 00011111110111101011011010001101011110110010000000 10000011010010101010000010000000000000000010000000 Best eval score so far: 0.8191 @ step 158000 epoch 12.87 loss: 0.084749, lagrangian_loss: -0.000121, attention_score_distillation_loss: 0.000193 loss: 0.091796, lagrangian_loss: -0.000019, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-20 04:20:42 Evaluating: accuracy: 0.8082, eval_loss: 0.6277, token_prune_loc: [True, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5268, expected_sparsity: 0.5125, expected_sequence_sparsity: 0.8544, target_sparsity: 0.5, step: 162000 lambda_1: -0.6602, lambda_2: 911.0939 lambda_3: 0.0000 train remain: [0.98 0.98 0.99 0.64 0.6 0.66 0.73 0.71 0.54 0.2 ] infer remain: [0.96, 0.96, 0.98, 0.64, 0.6, 0.66, 0.74, 0.72, 0.54, 0.2] layerwise remain: [1.0, 1.0, 0.96, 0.92, 0.9, 0.58, 0.35, 0.23, 0.17, 0.12, 0.07, 0.01] 10111111111111111111111111111111111111111111111110 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000001000000 11111111111111111111111110110101000010000000000000 10111111111111111111111110110111011011000000000000 10111111111111111111111110111101111011110100000000 10011111110111111111111110110111011111110010100000 00011111110111101011011010011101011110110000000000 00000011010010101011000010000000000000100000000000 Best eval score so far: 0.8191 @ step 158000 epoch 12.87 loss: 0.107582, lagrangian_loss: 0.000161, attention_score_distillation_loss: 0.000193 loss: 0.126471, lagrangian_loss: 0.000828, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 04:31:02 Evaluating: accuracy: 0.8196, eval_loss: 0.61, token_prune_loc: [True, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5203, expected_sparsity: 0.5027, expected_sequence_sparsity: 0.8514, target_sparsity: 0.5, step: 164000 lambda_1: -1.4176, lambda_2: 922.4958 lambda_3: 0.0000 train remain: [0.98 0.98 0.99 0.64 0.6 0.65 0.74 0.71 0.54 0.2 ] infer remain: [0.96, 1.0, 0.98, 0.64, 0.6, 0.64, 0.74, 0.72, 0.54, 0.2] layerwise remain: [1.0, 1.0, 0.96, 0.96, 0.94, 0.6, 0.36, 0.23, 0.17, 0.12, 0.07, 0.01] 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111111111101001000000000000000 11111111111111111111111110111100000010000000000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101111011110010000000 10011111110111111111111111110111011111110000100000 00011111110111101011011010011101011110110000000000 10000011010010101010000010000000000001000000000000 Best eval score so far: 0.8191 @ step 158000 epoch 12.87 Saving the best model so far: [Epoch 13 | Step: 164000 | MACs sparsity: 0.5203 | Score: 0.8196 | Loss: 0.61] loss: 0.079169, lagrangian_loss: -0.000495, attention_score_distillation_loss: 0.000195 loss: 0.082118, lagrangian_loss: 0.003527, attention_score_distillation_loss: 0.000191 ---------------------------------------------------------------------- time: 2023-07-20 04:41:43 Evaluating: accuracy: 0.8159, eval_loss: 0.6117, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.501, expected_sparsity: 0.4878, expected_sequence_sparsity: 0.8469, target_sparsity: 0.5, step: 166000 lambda_1: -1.1462, lambda_2: 933.5302 lambda_3: 0.0000 train remain: [0.98 0.98 0.98 0.64 0.6 0.64 0.74 0.71 0.54 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.64, 0.6, 0.64, 0.74, 0.72, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.63, 0.38, 0.24, 0.18, 0.13, 0.07, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000000010000 11111111111111111111111110110100000011000000000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101111011111000000000 10011111110111111111111110110111011111111000100000 00011111110111101011011010001101011110110000100000 00000011010011101010000010000000010000000000000000 Best eval score so far: 0.8196 @ step 164000 epoch 13.36 loss: 0.172009, lagrangian_loss: 0.005737, attention_score_distillation_loss: 0.000197 loss: 0.092157, lagrangian_loss: 0.002694, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 04:52:07 Evaluating: accuracy: 0.822, eval_loss: 0.5904, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.501, expected_sparsity: 0.4878, expected_sequence_sparsity: 0.8469, target_sparsity: 0.5, step: 168000 lambda_1: -0.2993, lambda_2: 944.8896 lambda_3: 0.0000 train remain: [0.98 0.98 0.98 0.64 0.6 0.64 0.74 0.71 0.54 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.64, 0.6, 0.64, 0.74, 0.72, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.63, 0.38, 0.24, 0.18, 0.13, 0.07, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001010000000000000 11111111111111111111111110111100000010000000000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111111111011110000000000 10011111110111111111111110110111011111110001100000 10011111110111101011011010001101011110110000000000 00000011110010101010000010000000000000010000000000 Best eval score so far: 0.8196 @ step 164000 epoch 13.36 Saving the best model so far: [Epoch 13 | Step: 168000 | MACs sparsity: 0.501 | Score: 0.822 | Loss: 0.5904] loss: 0.106942, lagrangian_loss: 0.001386, attention_score_distillation_loss: 0.000194 loss: 0.119480, lagrangian_loss: -0.000085, attention_score_distillation_loss: 0.000190 ---------------------------------------------------------------------- time: 2023-07-20 05:02:46 Evaluating: accuracy: 0.8232, eval_loss: 0.6118, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.501, expected_sparsity: 0.4878, expected_sequence_sparsity: 0.8469, target_sparsity: 0.5, step: 170000 lambda_1: -1.4184, lambda_2: 956.2252 lambda_3: 0.0000 train remain: [0.98 0.99 0.98 0.64 0.6 0.64 0.73 0.71 0.54 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.64, 0.6, 0.64, 0.74, 0.72, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.63, 0.38, 0.24, 0.18, 0.13, 0.07, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000001000000 11111111111111111111111111110100000010000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111111110000000000 10011111110111111111111110110111011111110100100000 10011111110111101011011010001101011110110000000000 10000011010010101010000010000100000000000000000000 Best eval score so far: 0.8220 @ step 168000 epoch 13.69 Saving the best model so far: [Epoch 13 | Step: 170000 | MACs sparsity: 0.501 | Score: 0.8232 | Loss: 0.6118] loss: 0.093803, lagrangian_loss: -0.000526, attention_score_distillation_loss: 0.000196 loss: 0.321053, lagrangian_loss: 0.001044, attention_score_distillation_loss: 0.000196 ETA: 1 day, 3:23:22 | Epoch 13 finished. Took 3865.47 seconds. ---------------------------------------------------------------------- time: 2023-07-20 05:13:26 Evaluating: accuracy: 0.8232, eval_loss: 0.6061, token_prune_loc: [True, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5203, expected_sparsity: 0.5027, expected_sequence_sparsity: 0.8514, target_sparsity: 0.5, step: 172000 lambda_1: -1.3463, lambda_2: 967.1171 lambda_3: 0.0000 train remain: [0.98 0.99 0.98 0.64 0.6 0.64 0.73 0.71 0.54 0.2 ] infer remain: [0.96, 1.0, 0.98, 0.64, 0.6, 0.64, 0.74, 0.72, 0.54, 0.2] layerwise remain: [1.0, 1.0, 0.96, 0.96, 0.94, 0.6, 0.36, 0.23, 0.17, 0.12, 0.07, 0.01] 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000000001000 11111111111111111111111110110100001010000000000000 10111111111111111111111110110101011010000100000000 10111111111111111111111110111101111011110000100000 10011111110111111111111110110111011111110000101000 10011111110111101011011010001101011110110000000000 00000011010110101010000010000100000000000000000000 Best eval score so far: 0.8232 @ step 170000 epoch 13.85 loss: 0.088375, lagrangian_loss: 0.000597, attention_score_distillation_loss: 0.000194 loss: 0.062185, lagrangian_loss: -0.000136, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 05:23:54 Evaluating: accuracy: 0.8156, eval_loss: 0.5917, token_prune_loc: [True, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5203, expected_sparsity: 0.5027, expected_sequence_sparsity: 0.8514, target_sparsity: 0.5, step: 174000 lambda_1: -0.6487, lambda_2: 977.9250 lambda_3: 0.0000 train remain: [0.98 0.99 0.98 0.64 0.6 0.64 0.74 0.71 0.54 0.2 ] infer remain: [0.96, 1.0, 0.98, 0.64, 0.6, 0.64, 0.74, 0.72, 0.54, 0.2] layerwise remain: [1.0, 1.0, 0.96, 0.96, 0.94, 0.6, 0.36, 0.23, 0.17, 0.12, 0.07, 0.01] 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111111001000000000000000 11111111111111111111111110110100000010000000100000 10111111111111111111111110110101011010000100000000 10111111111111111111111110111101111011110000100000 10011111110111111111111110110111111111110000100000 00011111110111101011011010001101011110110000100000 10000011010010101011000010000000000000000000000000 Best eval score so far: 0.8232 @ step 170000 epoch 13.85 loss: 0.069450, lagrangian_loss: 0.000234, attention_score_distillation_loss: 0.000197 loss: 0.120219, lagrangian_loss: 0.003719, attention_score_distillation_loss: 0.000188 ---------------------------------------------------------------------- time: 2023-07-20 05:34:20 Evaluating: accuracy: 0.8128, eval_loss: 0.6078, token_prune_loc: [True, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5203, expected_sparsity: 0.5032, expected_sequence_sparsity: 0.8516, target_sparsity: 0.5, step: 176000 lambda_1: -0.6444, lambda_2: 989.9530 lambda_3: 0.0000 train remain: [0.97 0.99 0.99 0.64 0.6 0.64 0.74 0.71 0.54 0.2 ] infer remain: [0.96, 1.0, 0.98, 0.64, 0.6, 0.64, 0.74, 0.7, 0.54, 0.2] layerwise remain: [1.0, 1.0, 0.96, 0.96, 0.94, 0.6, 0.36, 0.23, 0.17, 0.12, 0.06, 0.01] 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000000000100 11111111111111111111111110110100010010000000000000 10111111111111111111111110110101011110000000000000 10111111111111111111111110111111111011110000000000 10011111110111111111111110110111011111110000100000 00011111110111101011011010001101011110110000100000 00000011010010101010000010001000000001000000000000 Best eval score so far: 0.8232 @ step 170000 epoch 13.85 loss: 0.293445, lagrangian_loss: 0.012867, attention_score_distillation_loss: 0.000190 loss: 0.060773, lagrangian_loss: 0.000343, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 05:44:42 Evaluating: accuracy: 0.8062, eval_loss: 0.6439, token_prune_loc: [True, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5203, expected_sparsity: 0.5032, expected_sequence_sparsity: 0.8516, target_sparsity: 0.5, step: 178000 lambda_1: -0.9454, lambda_2: 1001.3156 lambda_3: 0.0000 train remain: [0.98 0.99 0.99 0.64 0.6 0.64 0.73 0.69 0.54 0.2 ] infer remain: [0.96, 1.0, 0.98, 0.64, 0.6, 0.64, 0.74, 0.7, 0.54, 0.2] layerwise remain: [1.0, 1.0, 0.96, 0.96, 0.94, 0.6, 0.36, 0.23, 0.17, 0.12, 0.06, 0.01] 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111111111101001000000000000000 11111111111111111111111110111100000010000000000000 10111111111111111111111110110101111010000000000000 10111111111111111111111110111101111011110010000000 10011111110111111111111110110111011111110000100000 00011111110111101011011010001101011110110000100000 00000011110010101010000010000000000001000000000000 Best eval score so far: 0.8232 @ step 170000 epoch 13.85 loss: 0.079491, lagrangian_loss: 0.002769, attention_score_distillation_loss: 0.000194 loss: 0.052623, lagrangian_loss: -0.000272, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-20 05:55:12 Evaluating: accuracy: 0.8206, eval_loss: 0.6206, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.501, expected_sparsity: 0.4882, expected_sequence_sparsity: 0.847, target_sparsity: 0.5, step: 180000 lambda_1: -0.8973, lambda_2: 1012.7067 lambda_3: 0.0000 train remain: [0.98 0.99 0.98 0.64 0.6 0.64 0.73 0.69 0.54 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.64, 0.6, 0.64, 0.74, 0.7, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.63, 0.38, 0.24, 0.18, 0.12, 0.07, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001010000000000000 11111111111111111111111110110100000010000000010000 10111111111111111111111110110101011010000001000000 10111111111111111111111110111111111011110000000000 10011111110111111111011110110111011111110010100000 00011111110111101011011010001101011110110000010000 10000011010010101010000010000000010000000000000000 Best eval score so far: 0.8232 @ step 170000 epoch 13.85 loss: 0.064206, lagrangian_loss: 0.000508, attention_score_distillation_loss: 0.000195 loss: 0.166656, lagrangian_loss: 0.007333, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-20 06:05:42 Evaluating: accuracy: 0.8087, eval_loss: 0.6297, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.501, expected_sparsity: 0.4882, expected_sequence_sparsity: 0.847, target_sparsity: 0.5, step: 182000 lambda_1: -0.9319, lambda_2: 1023.9681 lambda_3: 0.0000 train remain: [0.98 0.99 0.98 0.64 0.6 0.64 0.73 0.69 0.54 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.64, 0.6, 0.64, 0.74, 0.7, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.63, 0.38, 0.24, 0.18, 0.12, 0.07, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000001000000 11111111111111111111111110110101000010000000000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101111011110100000000 10011111110111111111011110110111011111110000110000 00011111110111101011011010011101011110110000000000 00000011010010101010000010001000010000000000000000 Best eval score so far: 0.8232 @ step 170000 epoch 13.85 loss: 0.056404, lagrangian_loss: -0.000182, attention_score_distillation_loss: 0.000195 loss: 0.177827, lagrangian_loss: 0.000525, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 06:16:08 Evaluating: accuracy: 0.8148, eval_loss: 0.5833, token_prune_loc: [True, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5203, expected_sparsity: 0.504, expected_sequence_sparsity: 0.8518, target_sparsity: 0.5, step: 184000 lambda_1: -1.5535, lambda_2: 1035.2235 lambda_3: 0.0000 train remain: [0.98 0.99 0.98 0.64 0.6 0.64 0.73 0.69 0.54 0.2 ] infer remain: [0.96, 1.0, 0.98, 0.64, 0.6, 0.64, 0.72, 0.7, 0.54, 0.2] layerwise remain: [1.0, 1.0, 0.96, 0.96, 0.94, 0.6, 0.36, 0.23, 0.17, 0.12, 0.06, 0.01] 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000001000000000 11111111111111111111111110110100000010000000000001 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101111011110000000000 10011111110111111111011110111111011111110000100000 00011111110111101011011010001101011110110000100000 00000011010010101010000010000000000101000000000000 Best eval score so far: 0.8232 @ step 170000 epoch 13.85 loss: 0.061356, lagrangian_loss: -0.000560, attention_score_distillation_loss: 0.000197 ETA: 1 day, 2:22:56 | Epoch 14 finished. Took 3892.12 seconds. loss: 0.117819, lagrangian_loss: 0.014614, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 06:26:36 Evaluating: accuracy: 0.814, eval_loss: 0.5912, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.501, expected_sparsity: 0.4891, expected_sequence_sparsity: 0.8473, target_sparsity: 0.5, step: 186000 lambda_1: -2.8309, lambda_2: 1046.6055 lambda_3: 0.0000 train remain: [0.98 0.99 0.98 0.64 0.59 0.64 0.73 0.69 0.54 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.64, 0.6, 0.64, 0.72, 0.7, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.63, 0.38, 0.24, 0.17, 0.12, 0.07, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101011000000000000000 11111111111111111111111110110100000010010000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111011110000000000 10011111110111111111011110110111011111110010100000 00011111110111101011011010001101011110110000010000 00000011010010101010000010000100000000010000000000 Best eval score so far: 0.8232 @ step 170000 epoch 13.85 loss: 0.161097, lagrangian_loss: 0.009394, attention_score_distillation_loss: 0.000189 loss: 0.251886, lagrangian_loss: 0.003215, attention_score_distillation_loss: 0.000190 ---------------------------------------------------------------------- time: 2023-07-20 06:37:00 Evaluating: accuracy: 0.8134, eval_loss: 0.6052, token_prune_loc: [True, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5203, expected_sparsity: 0.504, expected_sequence_sparsity: 0.8518, target_sparsity: 0.5, step: 188000 lambda_1: -0.5550, lambda_2: 1058.0601 lambda_3: 0.0000 train remain: [0.98 0.99 0.98 0.64 0.59 0.64 0.73 0.69 0.54 0.2 ] infer remain: [0.96, 1.0, 0.98, 0.64, 0.6, 0.64, 0.72, 0.7, 0.54, 0.2] layerwise remain: [1.0, 1.0, 0.96, 0.96, 0.94, 0.6, 0.36, 0.23, 0.17, 0.12, 0.06, 0.01] 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001100000000000000 11111111111111111111111110110100010010000000000000 10111111111111111111111110110101011010000000000100 10111111111111111111111110111101111011110000000000 10011111110111111111011110110111011111111000100000 00011111110111101011011010001101011110110000100000 00000011010010101010000010000001010000000000000000 Best eval score so far: 0.8232 @ step 170000 epoch 13.85 loss: 0.073433, lagrangian_loss: 0.000835, attention_score_distillation_loss: 0.000192 loss: 0.073502, lagrangian_loss: 0.000016, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-20 06:47:25 Evaluating: accuracy: 0.8213, eval_loss: 0.614, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.501, expected_sparsity: 0.4896, expected_sequence_sparsity: 0.8474, target_sparsity: 0.5, step: 190000 lambda_1: -0.6394, lambda_2: 1069.5061 lambda_3: 0.0000 train remain: [0.98 0.99 0.99 0.64 0.59 0.64 0.73 0.69 0.54 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.64, 0.6, 0.64, 0.72, 0.68, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.63, 0.38, 0.24, 0.17, 0.12, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000010000000000 11111111111111111111111110111100000010000000000000 10111111111111111111111110110101011110000000000000 10111111111111111111111110111101111011110000000000 10011111110111111111011110110111011111110000100000 00011111110111101011011010001101011110110010000000 10000011010010101010000010000100000000000000000000 Best eval score so far: 0.8232 @ step 170000 epoch 13.85 loss: 0.092523, lagrangian_loss: 0.000144, attention_score_distillation_loss: 0.000193 loss: 0.115155, lagrangian_loss: 0.000308, attention_score_distillation_loss: 0.000197 ---------------------------------------------------------------------- time: 2023-07-20 06:57:39 Evaluating: accuracy: 0.8133, eval_loss: 0.6212, token_prune_loc: [True, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5203, expected_sparsity: 0.507, expected_sequence_sparsity: 0.8527, target_sparsity: 0.5, step: 192000 lambda_1: -0.7723, lambda_2: 1080.7791 lambda_3: 0.0000 train remain: [0.98 0.99 0.99 0.64 0.59 0.64 0.73 0.69 0.54 0.2 ] infer remain: [0.96, 1.0, 0.98, 0.64, 0.58, 0.64, 0.72, 0.68, 0.54, 0.2] layerwise remain: [1.0, 1.0, 0.96, 0.96, 0.94, 0.6, 0.35, 0.22, 0.16, 0.11, 0.06, 0.01] 10111111111111111111111111111111111111111111111110 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001010000000000000 11111111111111111111111110110100000010000000000000 10111111111111111111111110110101111010000000000000 10111111111111111111111110111101111011110000000000 10011111110111111111011110110111011111110000100000 10011111110111101011011010001101011110110000000000 00000011010010101010000010001000000000000100000000 Best eval score so far: 0.8232 @ step 170000 epoch 13.85 loss: 0.164227, lagrangian_loss: -0.000135, attention_score_distillation_loss: 0.000193 loss: 0.064770, lagrangian_loss: 0.009877, attention_score_distillation_loss: 0.000184 ---------------------------------------------------------------------- time: 2023-07-20 07:07:57 Evaluating: accuracy: 0.8194, eval_loss: 0.5947, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4922, expected_sequence_sparsity: 0.8482, target_sparsity: 0.5, step: 194000 lambda_1: -0.8365, lambda_2: 1092.6553 lambda_3: 0.0000 train remain: [0.98 0.99 0.99 0.64 0.58 0.64 0.72 0.68 0.54 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.64, 0.58, 0.64, 0.72, 0.68, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.63, 0.36, 0.23, 0.17, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001100000000000000 11111111111111111111111110110100000010000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111011110000000000 10011111110111111111011110110111011111110000100000 00011111110111101011011010011101011110110000000000 00000011010010101010000010000100000001000000000000 Best eval score so far: 0.8232 @ step 170000 epoch 13.85 loss: 0.053766, lagrangian_loss: 0.000240, attention_score_distillation_loss: 0.000196 loss: 0.132860, lagrangian_loss: 0.001597, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 07:18:22 Evaluating: accuracy: 0.8256, eval_loss: 0.5679, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4922, expected_sequence_sparsity: 0.8482, target_sparsity: 0.5, step: 196000 lambda_1: -1.0598, lambda_2: 1103.7136 lambda_3: 0.0000 train remain: [0.98 1. 0.99 0.63 0.58 0.64 0.72 0.67 0.54 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.64, 0.58, 0.64, 0.72, 0.68, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.63, 0.36, 0.23, 0.17, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000001000000000 11111111111111111111111110110100000010000000000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101111011110000000000 10011111110111111111011110110111011111110000100000 00011111110111101011011010011101011110110000000000 00000011010010101011000010000100000000000000000000 Best eval score so far: 0.8232 @ step 170000 epoch 13.85 Saving the best model so far: [Epoch 15 | Step: 196000 | MACs sparsity: 0.5052 | Score: 0.8256 | Loss: 0.5679] loss: 0.068182, lagrangian_loss: 0.002431, attention_score_distillation_loss: 0.000193 ETA: 1 day, 1:20:07 | Epoch 15 finished. Took 3819.22 seconds. loss: 0.255298, lagrangian_loss: 0.004933, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-20 07:28:51 Evaluating: accuracy: 0.8259, eval_loss: 0.5736, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4962, expected_sequence_sparsity: 0.8494, target_sparsity: 0.5, step: 198000 lambda_1: -0.5225, lambda_2: 1115.0232 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.63 0.58 0.64 0.71 0.67 0.54 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.62, 0.58, 0.64, 0.72, 0.68, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.61, 0.35, 0.23, 0.16, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000000000000 11111111111111111111111110110100000010000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111011110000000000 10011111110111111111011110110111011111110000100000 00011111110111101011011010011101011110110000000000 00000011010010101011000010000000010000000000000000 Best eval score so far: 0.8256 @ step 196000 epoch 15.97 Saving the best model so far: [Epoch 16 | Step: 198000 | MACs sparsity: 0.5052 | Score: 0.8259 | Loss: 0.5736] loss: 0.123095, lagrangian_loss: 0.000395, attention_score_distillation_loss: 0.000196 loss: 0.283182, lagrangian_loss: 0.000178, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 07:39:14 Evaluating: accuracy: 0.8108, eval_loss: 0.6483, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.4967, expected_sparsity: 0.488, expected_sequence_sparsity: 0.8469, target_sparsity: 0.5, step: 200000 lambda_1: -0.4011, lambda_2: 1125.8365 lambda_3: 0.0000 train remain: [0.98 1. 0.99 0.63 0.58 0.64 0.72 0.68 0.54 0.2 ] infer remain: [1.0, 1.0, 1.0, 0.64, 0.58, 0.64, 0.72, 0.68, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.64, 0.37, 0.24, 0.17, 0.12, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111101001010000000000000 11111111111111111111111110111000000010000000000000 11111111111111111111111110110101011010000000000000 10111111111111111111111110111101111011110000000000 10011111110111111011111110110111011111110000100000 00011111110111101011011010011101011110110000000000 00000011010010101010000010000100000100000000000000 Best eval score so far: 0.8259 @ step 198000 epoch 16.13 loss: 0.128929, lagrangian_loss: 0.013007, attention_score_distillation_loss: 0.000188 loss: 0.071725, lagrangian_loss: 0.002999, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 07:49:29 Evaluating: accuracy: 0.8187, eval_loss: 0.5837, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4962, expected_sequence_sparsity: 0.8494, target_sparsity: 0.5, step: 202000 lambda_1: -0.3883, lambda_2: 1137.2180 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.63 0.58 0.64 0.71 0.67 0.54 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.62, 0.58, 0.64, 0.72, 0.68, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.61, 0.35, 0.23, 0.16, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000000000000 11111111111111111111111110110000010010000000000000 11111111111111111111111110110101011010000000000000 10111111111111111111111110111101111001110000100000 10111111110111111011011110110111011111110000100000 00011111110111101011011010001101011110110010000000 00000011010010101010000010000000000001010000000000 Best eval score so far: 0.8259 @ step 198000 epoch 16.13 loss: 0.091579, lagrangian_loss: 0.003675, attention_score_distillation_loss: 0.000188 loss: 0.060476, lagrangian_loss: -0.000061, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-20 07:59:46 Evaluating: accuracy: 0.8159, eval_loss: 0.6066, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4962, expected_sequence_sparsity: 0.8494, target_sparsity: 0.5, step: 204000 lambda_1: -0.7667, lambda_2: 1148.4537 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.58 0.64 0.71 0.67 0.54 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.62, 0.58, 0.64, 0.72, 0.68, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.61, 0.35, 0.23, 0.16, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000000000000 11111111111111111111111110110000000010000000000100 10111111111111111111111110110101011010000000100000 10111111111111111111111110111101111001110000010000 10011111110111111011011110111111011111110000100000 00011111110111101011011010001101011110111000000000 00000011110010101010000010000000000001000000000000 Best eval score so far: 0.8259 @ step 198000 epoch 16.13 loss: 0.075879, lagrangian_loss: 0.002662, attention_score_distillation_loss: 0.000191 loss: 0.175874, lagrangian_loss: 0.001013, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 08:10:05 Evaluating: accuracy: 0.8229, eval_loss: 0.5866, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4962, expected_sequence_sparsity: 0.8494, target_sparsity: 0.5, step: 206000 lambda_1: -0.3566, lambda_2: 1159.9722 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.58 0.64 0.71 0.67 0.54 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.62, 0.58, 0.64, 0.72, 0.68, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.61, 0.35, 0.23, 0.16, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000000000000 11111111111111111111111110110001000010000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111011110000000000 10111111110111111011011110110111011111110000100000 00011111110111101011011010001101011110110010000000 10000011010010101010000010000100000000000000000000 Best eval score so far: 0.8259 @ step 198000 epoch 16.13 loss: 0.541456, lagrangian_loss: 0.001812, attention_score_distillation_loss: 0.000192 loss: 0.077089, lagrangian_loss: 0.000429, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-20 08:20:25 Evaluating: accuracy: 0.8216, eval_loss: 0.5626, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4962, expected_sequence_sparsity: 0.8494, target_sparsity: 0.5, step: 208000 lambda_1: -0.5678, lambda_2: 1171.5178 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.58 0.64 0.71 0.68 0.54 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.62, 0.58, 0.64, 0.72, 0.68, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.61, 0.35, 0.23, 0.16, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000000000000 11111111111111111111111110110000000010010000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111001110100000000 10011111110111111011011110111111011111110000100000 00011111110111101011011010001101011110110000100000 00000011010010101010000010000100000000000000100000 Best eval score so far: 0.8259 @ step 198000 epoch 16.13 loss: 0.100864, lagrangian_loss: 0.015342, attention_score_distillation_loss: 0.000198 ETA: 1 day, 0:16:35 | Epoch 16 finished. Took 3792.07 seconds. loss: 0.092718, lagrangian_loss: 0.001235, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 08:30:44 Evaluating: accuracy: 0.816, eval_loss: 0.5923, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4962, expected_sequence_sparsity: 0.8494, target_sparsity: 0.5, step: 210000 lambda_1: -0.7263, lambda_2: 1182.6527 lambda_3: 0.0000 train remain: [0.99 1. 0.98 0.62 0.58 0.64 0.71 0.68 0.53 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.62, 0.58, 0.64, 0.72, 0.68, 0.54, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.61, 0.35, 0.23, 0.16, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111100001000000010000000 11111111111111111111111110110000000010000010000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111001110000100000 10011111110111111011011110111111011111110000100000 00011111110111101011011010011101011110110000000000 00000011010010101010000010000100010000000000000000 Best eval score so far: 0.8259 @ step 198000 epoch 16.13 loss: 0.109956, lagrangian_loss: -0.000078, attention_score_distillation_loss: 0.000193 loss: 0.106283, lagrangian_loss: -0.000058, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 08:41:01 Evaluating: accuracy: 0.8229, eval_loss: 0.5717, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4965, expected_sequence_sparsity: 0.8495, target_sparsity: 0.5, step: 212000 lambda_1: -0.5911, lambda_2: 1193.4750 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.58 0.64 0.71 0.67 0.53 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.62, 0.58, 0.64, 0.72, 0.68, 0.52, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.61, 0.35, 0.23, 0.16, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111100001001000000000000 11111111111111111111111110111000000010000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111001110100000000 10011111110111111011011110110111011111110000110000 00011111110111101011011010001101011110110000000000 00000011010010101010000010000000010001000000000000 Best eval score so far: 0.8259 @ step 198000 epoch 16.13 loss: 0.062140, lagrangian_loss: -0.000052, attention_score_distillation_loss: 0.000196 loss: 0.106864, lagrangian_loss: 0.003207, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 08:51:15 Evaluating: accuracy: 0.8166, eval_loss: 0.6007, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4965, expected_sequence_sparsity: 0.8495, target_sparsity: 0.5, step: 214000 lambda_1: -0.3847, lambda_2: 1204.2765 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.58 0.64 0.72 0.67 0.53 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.62, 0.58, 0.64, 0.72, 0.68, 0.52, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.61, 0.35, 0.23, 0.16, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111100001000100000000000 11111111111111111111111110110000000010000000001000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111001110010000000 10011111110111111011011110110111011111110010100000 00011111110111101011011010001101011110110000000000 00000011010010101011000010000001000000000000000000 Best eval score so far: 0.8259 @ step 198000 epoch 16.13 loss: 0.056642, lagrangian_loss: 0.001188, attention_score_distillation_loss: 0.000195 loss: 0.064658, lagrangian_loss: 0.000036, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 09:01:27 Evaluating: accuracy: 0.8157, eval_loss: 0.5965, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4965, expected_sequence_sparsity: 0.8495, target_sparsity: 0.5, step: 216000 lambda_1: -0.8640, lambda_2: 1216.4698 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.58 0.64 0.72 0.67 0.52 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.62, 0.58, 0.64, 0.72, 0.68, 0.52, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.61, 0.35, 0.23, 0.16, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000000000000 11111111111111111111111110110000000010010000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111001110001000000 10011111110111111011011110111111011111110000100000 00011111110111101011011010001101011110110000000000 00000011010010101010000010000000000000000000000011 Best eval score so far: 0.8259 @ step 198000 epoch 16.13 loss: 0.122588, lagrangian_loss: 0.000722, attention_score_distillation_loss: 0.000193 loss: 0.131130, lagrangian_loss: 0.001744, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 09:11:38 Evaluating: accuracy: 0.8244, eval_loss: 0.5754, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4965, expected_sequence_sparsity: 0.8495, target_sparsity: 0.5, step: 218000 lambda_1: -0.4841, lambda_2: 1227.2792 lambda_3: 0.0000 train remain: [0.99 1. 0.98 0.62 0.58 0.64 0.72 0.67 0.52 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.62, 0.58, 0.64, 0.72, 0.68, 0.52, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.61, 0.35, 0.23, 0.16, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111100001000000000010000 11111111111111111111111110110100000010000000000000 10111111111111111111111110110101011010000100000000 10111111111111111111111110111101111001110001000000 10011111110111111011011110111111011111110000100000 10011101110111101011011010001101011110110000000000 10000011010010101010000010000100000000000000000000 Best eval score so far: 0.8259 @ step 198000 epoch 16.13 loss: 0.054384, lagrangian_loss: 0.004471, attention_score_distillation_loss: 0.000196 loss: 0.055418, lagrangian_loss: 0.000113, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-20 09:21:53 Evaluating: accuracy: 0.8231, eval_loss: 0.5853, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4965, expected_sequence_sparsity: 0.8495, target_sparsity: 0.5, step: 220000 lambda_1: -0.4076, lambda_2: 1238.6625 lambda_3: 0.0000 train remain: [0.99 1. 0.98 0.62 0.58 0.64 0.72 0.67 0.52 0.2 ] infer remain: [1.0, 1.0, 0.98, 0.62, 0.58, 0.64, 0.72, 0.68, 0.52, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.61, 0.35, 0.23, 0.16, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111110001000000000000000 11111111111111111111111110110100000010000000000000 10111111111111111111111110110101011010000100000000 10111111111111111111111110111101111001110001000000 10011111110111111011011110111111011111110000100000 00011101110111101011011010011101011110110000000000 10000011010010101010000010000100000000000000000000 Best eval score so far: 0.8259 @ step 198000 epoch 16.13 loss: 0.169557, lagrangian_loss: 0.001396, attention_score_distillation_loss: 0.000196 ETA: 23:12:31 | Epoch 17 finished. Took 3763.03 seconds. loss: 0.065390, lagrangian_loss: 0.002043, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 09:32:11 Evaluating: accuracy: 0.8185, eval_loss: 0.5646, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4966, expected_sequence_sparsity: 0.8496, target_sparsity: 0.5, step: 222000 lambda_1: -0.7090, lambda_2: 1250.0664 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.58 0.64 0.71 0.67 0.52 0.19] infer remain: [1.0, 1.0, 0.98, 0.62, 0.58, 0.64, 0.72, 0.68, 0.52, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.61, 0.35, 0.23, 0.16, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111110001000000000000000 11111111111111111111111110110000000110000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111111111001110000000000 10011111110111111011011110111111011111110000100000 00011101110111101011011010001101011111110000000000 00000011010010101011000010000000000000000000000000 Best eval score so far: 0.8259 @ step 198000 epoch 16.13 loss: 0.226622, lagrangian_loss: -0.000025, attention_score_distillation_loss: 0.000196 loss: 0.098767, lagrangian_loss: 0.000546, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 09:42:28 Evaluating: accuracy: 0.8256, eval_loss: 0.5635, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4966, expected_sequence_sparsity: 0.8496, target_sparsity: 0.5, step: 224000 lambda_1: -0.3564, lambda_2: 1260.8536 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.57 0.64 0.72 0.67 0.52 0.18] infer remain: [1.0, 1.0, 0.98, 0.62, 0.58, 0.64, 0.72, 0.68, 0.52, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.61, 0.35, 0.23, 0.16, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111100001010000000000000 11111111111111111111111110110000010010000000000000 10111111111111111111111110110101011010000000001000 10111111111111111111111110111101111001110010000000 10011111110111111011011110110111011111110010100000 00011101110111101011011010001101011111110000000000 10000011010010101010000010000000000000000000000000 Best eval score so far: 0.8259 @ step 198000 epoch 16.13 loss: 0.044794, lagrangian_loss: 0.000047, attention_score_distillation_loss: 0.000191 loss: 0.089347, lagrangian_loss: 0.002005, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 09:52:45 Evaluating: accuracy: 0.819, eval_loss: 0.6067, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5074, expected_sparsity: 0.4992, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 226000 lambda_1: -0.2788, lambda_2: 1272.3175 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.57 0.64 0.72 0.68 0.52 0.18] infer remain: [1.0, 1.0, 0.98, 0.62, 0.56, 0.64, 0.72, 0.68, 0.52, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.61, 0.34, 0.22, 0.16, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111100001000000000100000 11111111111111111111111110110000000010000000000000 10111111111111111111111110110101011010000001000000 10111111111111111111111110111101111001110010000000 10011111110111111011011110110111011111110010100000 00011101110111101011011010011101011110110000000000 00000011010010101000000010001100000000000000000000 Best eval score so far: 0.8259 @ step 198000 epoch 16.13 loss: 0.057331, lagrangian_loss: 0.000363, attention_score_distillation_loss: 0.000196 loss: 0.236378, lagrangian_loss: 0.001163, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-20 10:02:55 Evaluating: accuracy: 0.8179, eval_loss: 0.5801, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5074, expected_sparsity: 0.4992, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 228000 lambda_1: -0.5084, lambda_2: 1283.6144 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.57 0.64 0.72 0.67 0.51 0.18] infer remain: [1.0, 1.0, 0.98, 0.62, 0.56, 0.64, 0.72, 0.68, 0.52, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.61, 0.34, 0.22, 0.16, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111100001100000000000000 11111111111111111111111110110000000010000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111001111000000000 10011111110111111011011110111111011111110000100000 00011101110111101011011010001101011110110010000000 00000011010010101000000010000000000000010100000000 Best eval score so far: 0.8259 @ step 198000 epoch 16.13 loss: 0.068802, lagrangian_loss: 0.008518, attention_score_distillation_loss: 0.000191 loss: 0.085599, lagrangian_loss: 0.000146, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 10:13:09 Evaluating: accuracy: 0.8264, eval_loss: 0.5647, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5074, expected_sparsity: 0.4992, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 230000 lambda_1: -0.3656, lambda_2: 1295.0739 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.72 0.67 0.51 0.18] infer remain: [1.0, 1.0, 0.98, 0.62, 0.56, 0.64, 0.72, 0.68, 0.52, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.98, 0.61, 0.34, 0.22, 0.16, 0.11, 0.06, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111110 11111111111111111111111110111101001000000000000000 11111111111111111111111110110000000010000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111011110000000000 10011111110111111011011110110111011111110010100000 00011101110111101011011010001101011111110000000000 00000011010010101001000010000000010000000000000000 Best eval score so far: 0.8259 @ step 198000 epoch 16.13 Saving the best model so far: [Epoch 18 | Step: 230000 | MACs sparsity: 0.5074 | Score: 0.8264 | Loss: 0.5647] loss: 0.110046, lagrangian_loss: 0.004896, attention_score_distillation_loss: 0.000193 loss: 0.123305, lagrangian_loss: 0.000625, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 10:23:29 Evaluating: accuracy: 0.8233, eval_loss: 0.5836, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4953, expected_sequence_sparsity: 0.8492, target_sparsity: 0.5, step: 232000 lambda_1: -0.4656, lambda_2: 1305.9414 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.72 0.67 0.51 0.18] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110100000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111001110000010000 10011111110111111011011110110111011111110001100000 00011101110111101011011010001101011110110000000000 10000011010010101000000010000100000000000000000000 Best eval score so far: 0.8264 @ step 230000 epoch 18.74 loss: 0.050727, lagrangian_loss: 0.000948, attention_score_distillation_loss: 0.000193 loss: 0.049199, lagrangian_loss: 0.000723, attention_score_distillation_loss: 0.000193 ETA: 22:08:42 | Epoch 18 finished. Took 3770.11 seconds. ---------------------------------------------------------------------- time: 2023-07-20 10:33:44 Evaluating: accuracy: 0.8184, eval_loss: 0.5883, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4953, expected_sequence_sparsity: 0.8492, target_sparsity: 0.5, step: 234000 lambda_1: -0.3653, lambda_2: 1317.1113 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.72 0.67 0.51 0.18] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000100000000 11111111111111111111111110110000000000000001000000 10111111111111111111111110110101011010010000000000 10111111111111111111111110111101111101110000000000 10011111110111111011111110110111011111110000100000 00011101110111101011011010001101011110110000000000 00000011010010101001000010000000000001000000000000 Best eval score so far: 0.8264 @ step 230000 epoch 18.74 loss: 0.062990, lagrangian_loss: 0.003989, attention_score_distillation_loss: 0.000196 loss: 0.057639, lagrangian_loss: 0.001484, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-20 10:43:56 Evaluating: accuracy: 0.8272, eval_loss: 0.5673, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4953, expected_sequence_sparsity: 0.8492, target_sparsity: 0.5, step: 236000 lambda_1: -0.4041, lambda_2: 1328.0919 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.56 0.64 0.72 0.67 0.5 0.18] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001001000000000000 11111111111111111111111110110001000000000000000000 10111111111111111111111110110101011010000001000000 10111111111111111111111110111101111101110000000000 10011111110111111011011110110111011111110001100000 00011101110111101011011010001101011100110010000000 00000011110010101000001010000000000000000000000000 Best eval score so far: 0.8264 @ step 230000 epoch 18.74 Saving the best model so far: [Epoch 19 | Step: 236000 | MACs sparsity: 0.5052 | Score: 0.8272 | Loss: 0.5673] loss: 0.057882, lagrangian_loss: -0.000028, attention_score_distillation_loss: 0.000193 loss: 0.043422, lagrangian_loss: -0.000022, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-20 10:54:26 Evaluating: accuracy: 0.8205, eval_loss: 0.6014, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4953, expected_sequence_sparsity: 0.8492, target_sparsity: 0.5, step: 238000 lambda_1: -0.3427, lambda_2: 1339.0387 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.56 0.64 0.72 0.67 0.5 0.18] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001001000000000000 11111111111111111111111110110000000010000000000000 10111111111111111111111110110101011010000100000000 10111111111111111111111110111101111101110000000000 10011111110111111011011110110111011111110000110000 00011101110111101011011010011101011100110000000000 00000011010010101010000010000000000001000000000000 Best eval score so far: 0.8272 @ step 236000 epoch 19.23 loss: 0.049152, lagrangian_loss: 0.004928, attention_score_distillation_loss: 0.000196 loss: 0.096368, lagrangian_loss: 0.003404, attention_score_distillation_loss: 0.000190 ---------------------------------------------------------------------- time: 2023-07-20 11:04:36 Evaluating: accuracy: 0.8264, eval_loss: 0.5758, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4953, expected_sequence_sparsity: 0.8492, target_sparsity: 0.5, step: 240000 lambda_1: -0.3389, lambda_2: 1349.7075 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.56 0.64 0.72 0.67 0.51 0.17] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000001000000 11111111111111111111111110110000000000010000000000 10111111111111111111111110110101011110000000000000 10111111111111111111111110111101111101110000000000 10011111110111111011011110111111011111110000100000 00011101110111101011011010001101011100110010000000 10000011010010101000000010000000000100000000000000 Best eval score so far: 0.8272 @ step 236000 epoch 19.23 loss: 0.137247, lagrangian_loss: 0.006938, attention_score_distillation_loss: 0.000195 loss: 0.060892, lagrangian_loss: 0.000544, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 11:15:01 Evaluating: accuracy: 0.8267, eval_loss: 0.5658, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4955, expected_sequence_sparsity: 0.8492, target_sparsity: 0.5, step: 242000 lambda_1: -0.2928, lambda_2: 1361.7422 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.62 0.56 0.64 0.72 0.67 0.51 0.17] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.5, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000010000 11111111111111111111111110110000000000001000000000 10111111111111111111111110110101011010000100000000 10111111111111111111111110111101111101110000000000 10011111110111111011011110110111011111110001100000 00011101110111101011011010001101011100110000100000 00000011010010101000010010000000000000000000000000 Best eval score so far: 0.8272 @ step 236000 epoch 19.23 loss: 0.086724, lagrangian_loss: 0.000374, attention_score_distillation_loss: 0.000197 loss: 0.035307, lagrangian_loss: 0.000574, attention_score_distillation_loss: 0.000197 ---------------------------------------------------------------------- time: 2023-07-20 11:25:23 Evaluating: accuracy: 0.8251, eval_loss: 0.5631, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4955, expected_sequence_sparsity: 0.8492, target_sparsity: 0.5, step: 244000 lambda_1: -0.2695, lambda_2: 1373.0393 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.56 0.64 0.72 0.68 0.51 0.17] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.5, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110000000001000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111101110000000000 10011111110111111011011110110111011111111000100000 00011101110111101011011010001101011100110000100000 00000011010010101000000010000000000001000000000000 Best eval score so far: 0.8272 @ step 236000 epoch 19.23 loss: 0.033806, lagrangian_loss: 0.007750, attention_score_distillation_loss: 0.000197 loss: 0.029448, lagrangian_loss: 0.030912, attention_score_distillation_loss: 0.000192 ETA: 21:05:26 | Epoch 19 finished. Took 3796.75 seconds. ---------------------------------------------------------------------- time: 2023-07-20 11:35:48 Evaluating: accuracy: 0.8315, eval_loss: 0.5576, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4955, expected_sequence_sparsity: 0.8492, target_sparsity: 0.5, step: 246000 lambda_1: -0.2077, lambda_2: 1383.8309 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.62 0.56 0.64 0.72 0.68 0.51 0.17] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.5, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110010000000000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111101110000000000 10011111110111111011111110110111011111110000100000 00011101110111101011011010001101011101110000000000 00000011010010101000000010000000000001000000000000 Best eval score so far: 0.8272 @ step 236000 epoch 19.23 Saving the best model so far: [Epoch 20 | Step: 246000 | MACs sparsity: 0.5052 | Score: 0.8315 | Loss: 0.5576] loss: 0.057055, lagrangian_loss: -0.000008, attention_score_distillation_loss: 0.000192 loss: 0.214479, lagrangian_loss: 0.002307, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 11:46:21 Evaluating: accuracy: 0.8232, eval_loss: 0.5844, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4955, expected_sequence_sparsity: 0.8492, target_sparsity: 0.5, step: 248000 lambda_1: -0.5321, lambda_2: 1395.1676 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.62 0.56 0.64 0.72 0.68 0.51 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.5, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001100000000000000 11111111111111111111111110110010000000000000000000 10111111111111111111111110110101011110000000000000 10111111111111111111111110111101111101110000000000 10011111110111111011011110110111011111110010100000 00011101110111101011011010011101011100110000000000 00000011010010101000000010000000000000010000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.056983, lagrangian_loss: 0.005189, attention_score_distillation_loss: 0.000197 loss: 0.026325, lagrangian_loss: 0.004138, attention_score_distillation_loss: 0.000197 ---------------------------------------------------------------------- time: 2023-07-20 11:56:48 Evaluating: accuracy: 0.8281, eval_loss: 0.5712, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4955, expected_sequence_sparsity: 0.8492, target_sparsity: 0.5, step: 250000 lambda_1: -0.3664, lambda_2: 1406.3376 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.62 0.56 0.64 0.72 0.68 0.5 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.5, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000010000 11111111111111111111111110110001000000000000000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101111101110000000000 10011111110111111011011110110111011111110100100000 00011101110111101011011010001101011100110010000000 10000010010010101000000010000000000000000100000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.047812, lagrangian_loss: 0.000703, attention_score_distillation_loss: 0.000189 loss: 0.073155, lagrangian_loss: 0.024778, attention_score_distillation_loss: 0.000189 ---------------------------------------------------------------------- time: 2023-07-20 12:07:05 Evaluating: accuracy: 0.8296, eval_loss: 0.5787, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4955, expected_sequence_sparsity: 0.8492, target_sparsity: 0.5, step: 252000 lambda_1: -0.3089, lambda_2: 1417.6064 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.56 0.64 0.72 0.67 0.5 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.5, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001100000000000000 11111111111111111111111110110000000100000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111011110000000000 10011111110111111011011110111111011111110000100000 00011101110111101011011010001101011100110010000000 00000010010010101000000010000100010000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.085206, lagrangian_loss: 0.002967, attention_score_distillation_loss: 0.000196 loss: 0.053493, lagrangian_loss: 0.000025, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 12:17:21 Evaluating: accuracy: 0.8257, eval_loss: 0.5883, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4955, expected_sequence_sparsity: 0.8492, target_sparsity: 0.5, step: 254000 lambda_1: -0.3837, lambda_2: 1428.8234 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.56 0.64 0.72 0.67 0.5 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.5, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001001000000000000 11111111111111111111111110110100000000000000000000 10111111111111111111111110110101011010000010000000 10111111111111111111111110111101111001110000100000 10011111110111111011011110110111011111110010100000 00011101110111101011011010001101011100110000010000 00000010110010101001000010000000000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.168770, lagrangian_loss: 0.001697, attention_score_distillation_loss: 0.000190 loss: 0.242238, lagrangian_loss: 0.000332, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 12:27:44 Evaluating: accuracy: 0.8191, eval_loss: 0.5779, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4955, expected_sequence_sparsity: 0.8492, target_sparsity: 0.5, step: 256000 lambda_1: -0.3410, lambda_2: 1440.1693 lambda_3: 0.0000 train remain: [0.99 1. 0.99 0.62 0.56 0.64 0.72 0.67 0.5 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.5, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000000010 11111111111111111111111110110000000010000000000000 10111111111111111111111110110101011010000010000000 10111111111111111111111110111101111101110000000000 10011111110111111011011110111111011111110000100000 00011101110111101011011010001101011100110000100000 00000010010010101000000010000000010001000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.067080, lagrangian_loss: 0.000779, attention_score_distillation_loss: 0.000197 loss: 0.029442, lagrangian_loss: 0.002538, attention_score_distillation_loss: 0.000190 ETA: 20:02:33 | Epoch 20 finished. Took 3822.05 seconds. ---------------------------------------------------------------------- time: 2023-07-20 12:38:10 Evaluating: accuracy: 0.8229, eval_loss: 0.5666, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4955, expected_sequence_sparsity: 0.8492, target_sparsity: 0.5, step: 258000 lambda_1: -0.2879, lambda_2: 1451.4930 lambda_3: 0.0000 train remain: [0.99 1. 1. 0.62 0.56 0.64 0.71 0.67 0.5 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.5, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100011000000000000000 11111111111111111111111110110000100000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111101110000000000 10011111110111111011011110111111011111110000100000 00011101110111101011011010001101011100110000100000 00000010110010101000000010000000010000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.126934, lagrangian_loss: 0.004724, attention_score_distillation_loss: 0.000195 loss: 0.038309, lagrangian_loss: 0.002866, attention_score_distillation_loss: 0.000191 ---------------------------------------------------------------------- time: 2023-07-20 12:48:31 Evaluating: accuracy: 0.821, eval_loss: 0.6035, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4955, expected_sequence_sparsity: 0.8492, target_sparsity: 0.5, step: 260000 lambda_1: -0.4839, lambda_2: 1462.4437 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.67 0.5 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.5, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111110001000000000000000 11111111111111111111111110110001000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111101110000000000 10011111110111111011011110111111011111110000100000 00011101110111101011011010001101011110110000000000 00000010010010101001000010010000000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.048121, lagrangian_loss: 0.006260, attention_score_distillation_loss: 0.000197 loss: 0.188728, lagrangian_loss: 0.003054, attention_score_distillation_loss: 0.000191 ---------------------------------------------------------------------- time: 2023-07-20 12:58:49 Evaluating: accuracy: 0.8266, eval_loss: 0.5843, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4955, expected_sequence_sparsity: 0.8492, target_sparsity: 0.5, step: 262000 lambda_1: -0.5873, lambda_2: 1473.8583 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.67 0.49 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.5, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001001000000000000 11111111111111111111111110110100000000000000000000 10111111111111111111111110110101011010000100000000 10111111111111111111111110111101111101110000000000 10011111110111111011011110111111011111110000100000 00011101110111101011011010001101011100110010000000 00000010010010101001000010000100000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.042839, lagrangian_loss: 0.000278, attention_score_distillation_loss: 0.000192 loss: 0.064444, lagrangian_loss: 0.000505, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-20 13:09:12 Evaluating: accuracy: 0.8286, eval_loss: 0.5681, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4957, expected_sequence_sparsity: 0.8493, target_sparsity: 0.5, step: 264000 lambda_1: -0.2911, lambda_2: 1484.7356 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.67 0.49 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.48, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100011000000000000000 11111111111111111111111110110100000000000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111001110010000000 10011111110111111011011110111111011111110000100000 00011101110111101011011010001101011100110000000000 10000010010010101000000010000000000000010000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.126043, lagrangian_loss: 0.000283, attention_score_distillation_loss: 0.000192 loss: 0.095071, lagrangian_loss: -0.000003, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 13:19:33 Evaluating: accuracy: 0.824, eval_loss: 0.5637, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4957, expected_sequence_sparsity: 0.8493, target_sparsity: 0.5, step: 266000 lambda_1: -0.2519, lambda_2: 1496.0173 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.67 0.49 0.17] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.48, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001100000000000000 11111111111111111111111110110000000000000100000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111011110000000000 10011111110111111011011110110111111111110000100000 00011101110111101011011010001101011100110000000000 10000010010010101000000010000100000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.069300, lagrangian_loss: 0.000108, attention_score_distillation_loss: 0.000197 loss: 0.034914, lagrangian_loss: 0.000220, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 13:29:50 Evaluating: accuracy: 0.8248, eval_loss: 0.5719, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4957, expected_sequence_sparsity: 0.8493, target_sparsity: 0.5, step: 268000 lambda_1: -0.2940, lambda_2: 1507.2919 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.67 0.49 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.48, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110000000000000000010000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111011110000000000 10011111110111111111011110110111011111110000100000 10001101110111101011011010001101011100110000000000 00000010010010101000010010000000000000000010000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.046326, lagrangian_loss: 0.001375, attention_score_distillation_loss: 0.000195 loss: 0.056109, lagrangian_loss: 0.000273, attention_score_distillation_loss: 0.000197 ETA: 18:59:12 | Epoch 21 finished. Took 3793.05 seconds. ---------------------------------------------------------------------- time: 2023-07-20 13:40:06 Evaluating: accuracy: 0.8285, eval_loss: 0.582, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4957, expected_sequence_sparsity: 0.8493, target_sparsity: 0.5, step: 270000 lambda_1: -0.2990, lambda_2: 1519.0188 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.72 0.67 0.48 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.68, 0.48, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000010000000 11111111111111111111111110110000000001000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111011110000000000 10011111110111111011011110111111011111110000100000 10001101110111101011011010001101011100110000000000 10000010010010101000000010000000000000010000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.046857, lagrangian_loss: 0.000451, attention_score_distillation_loss: 0.000193 loss: 0.042992, lagrangian_loss: 0.009684, attention_score_distillation_loss: 0.000185 ---------------------------------------------------------------------- time: 2023-07-20 13:50:21 Evaluating: accuracy: 0.8237, eval_loss: 0.5705, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4961, expected_sequence_sparsity: 0.8494, target_sparsity: 0.5, step: 272000 lambda_1: -0.3639, lambda_2: 1530.3840 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.67 0.48 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.66, 0.48, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000000001 11111111111111111111111110110000000001000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111001110010000000 10011111110111111011011110110111011111110000100000 10001101110111101011011010001101011100110000000000 00000010110010101000000010000000000001000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.060280, lagrangian_loss: 0.000087, attention_score_distillation_loss: 0.000194 loss: 0.067729, lagrangian_loss: 0.006996, attention_score_distillation_loss: 0.000191 ---------------------------------------------------------------------- time: 2023-07-20 14:00:36 Evaluating: accuracy: 0.8222, eval_loss: 0.5778, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4961, expected_sequence_sparsity: 0.8494, target_sparsity: 0.5, step: 274000 lambda_1: -0.6063, lambda_2: 1541.7102 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.66 0.48 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.66, 0.48, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000100000000000 11111111111111111111111111110000000000000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111101110000000000 10011111110111111011011110110111011111110000100000 00001101110111101011011010011101011100110000000000 00000010010010101000000010000001000001000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.035328, lagrangian_loss: 0.000687, attention_score_distillation_loss: 0.000195 loss: 0.062138, lagrangian_loss: 0.005707, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 14:10:53 Evaluating: accuracy: 0.8237, eval_loss: 0.602, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4961, expected_sequence_sparsity: 0.8494, target_sparsity: 0.5, step: 276000 lambda_1: -0.2564, lambda_2: 1553.1981 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.66 0.48 0.17] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.66, 0.48, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000000010 11111111111111111111111110110000000000000000010000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111001110010000000 10011111110111111011011110110111011111110000100000 00001101110111101011011010001101011100110001000000 00000010010010101000000010000000000000010100000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.075879, lagrangian_loss: 0.000138, attention_score_distillation_loss: 0.000193 loss: 0.056419, lagrangian_loss: 0.000222, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 14:21:08 Evaluating: accuracy: 0.8229, eval_loss: 0.5861, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4961, expected_sequence_sparsity: 0.8494, target_sparsity: 0.5, step: 278000 lambda_1: -0.2796, lambda_2: 1564.7177 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.66 0.48 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.66, 0.48, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111101001000000000000000 11111111111111111111111110110000000000010000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111101110000000000 10011111110111111011011110110111011111110000100000 00001101110111101011011010011101011100110000000000 00000010010010101001000010000100000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.057947, lagrangian_loss: 0.000264, attention_score_distillation_loss: 0.000192 loss: 0.229743, lagrangian_loss: 0.009653, attention_score_distillation_loss: 0.000198 ---------------------------------------------------------------------- time: 2023-07-20 14:31:22 Evaluating: accuracy: 0.8173, eval_loss: 0.6086, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4961, expected_sequence_sparsity: 0.8494, target_sparsity: 0.5, step: 280000 lambda_1: -0.4373, lambda_2: 1575.6095 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.65 0.48 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.66, 0.48, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000100000 11111111111111111111111110110000000001000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111101110000000000 10011111110111111011011110110111011111110000100000 00001101110111101011011010001101011100110010000000 00000010010010101000000010000100010000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.022800, lagrangian_loss: 0.000021, attention_score_distillation_loss: 0.000195 loss: 0.037542, lagrangian_loss: 0.001691, attention_score_distillation_loss: 0.000187 ---------------------------------------------------------------------- time: 2023-07-20 14:41:35 Evaluating: accuracy: 0.8244, eval_loss: 0.5799, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4961, expected_sequence_sparsity: 0.8494, target_sparsity: 0.5, step: 282000 lambda_1: -0.4589, lambda_2: 1587.0476 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.65 0.48 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.66, 0.48, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110100000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111001110000010000 10011111110111111011011110110111011101110000110000 00001101110111101011011010001101011100110001000000 10000010010010101001000010000000000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.032944, lagrangian_loss: 0.000012, attention_score_distillation_loss: 0.000194 ETA: 17:56:10 | Epoch 22 finished. Took 3817.99 seconds. loss: 0.066151, lagrangian_loss: 0.004888, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 14:51:53 Evaluating: accuracy: 0.8228, eval_loss: 0.5892, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4961, expected_sequence_sparsity: 0.8494, target_sparsity: 0.5, step: 284000 lambda_1: -0.3909, lambda_2: 1598.5747 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.65 0.48 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.66, 0.48, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111110001000000000000000 11111111111111111111111110110000000000000000000100 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111001110000100000 10011111110111111011011110110111011101110100100000 00001101110111101011011010001101011101110000000000 00000010010010101001000010000000000001000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.028309, lagrangian_loss: 0.011174, attention_score_distillation_loss: 0.000187 loss: 0.049851, lagrangian_loss: 0.001711, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 15:02:09 Evaluating: accuracy: 0.8269, eval_loss: 0.587, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5052, expected_sparsity: 0.4963, expected_sequence_sparsity: 0.8495, target_sparsity: 0.5, step: 286000 lambda_1: -0.2409, lambda_2: 1610.0529 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.72 0.65 0.47 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.66, 0.46, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.11, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000000100 11111111111111111111111110110000001000000000000000 10111111111111111111111110110101011010000100000000 10111111111111111111111110111101111001110000010000 10111111110111111011011110110111011101110000100000 00001101110111101011011010001101011100110000000000 10000010010010101000000010010000000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.258007, lagrangian_loss: 0.002403, attention_score_distillation_loss: 0.000196 loss: 0.037979, lagrangian_loss: 0.009439, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 15:12:27 Evaluating: accuracy: 0.8235, eval_loss: 0.5764, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4967, expected_sequence_sparsity: 0.8496, target_sparsity: 0.5, step: 288000 lambda_1: -0.2370, lambda_2: 1621.0662 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.72 0.65 0.47 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.64, 0.46, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100011000000000000000 11111111111111111111111110110000010000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111001110010000000 10011111110111111011011110110111011101110000100000 00001101110111101011011010001101011100110000000000 00000010010010101001000010000100000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.050787, lagrangian_loss: 0.001629, attention_score_distillation_loss: 0.000191 loss: 0.039997, lagrangian_loss: 0.003027, attention_score_distillation_loss: 0.000191 ---------------------------------------------------------------------- time: 2023-07-20 15:22:37 Evaluating: accuracy: 0.8265, eval_loss: 0.5807, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4967, expected_sequence_sparsity: 0.8496, target_sparsity: 0.5, step: 290000 lambda_1: -0.3344, lambda_2: 1632.5874 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.65 0.46 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.72, 0.64, 0.46, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000001000000000 11111111111111111111111110110000100000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111011110000000000 10011111110111111011011110110111011101110000100000 00001101110111101011011010001101011100110000000000 00000010110010101010000010000000000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.065826, lagrangian_loss: 0.000495, attention_score_distillation_loss: 0.000192 loss: 0.078374, lagrangian_loss: 0.001571, attention_score_distillation_loss: 0.000191 ---------------------------------------------------------------------- time: 2023-07-20 15:32:50 Evaluating: accuracy: 0.8294, eval_loss: 0.5549, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4974, expected_sequence_sparsity: 0.8498, target_sparsity: 0.5, step: 292000 lambda_1: -0.5335, lambda_2: 1643.5621 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.64 0.45 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.64, 0.46, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.05, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100011000000000000000 11111111111111111111111110110000000010000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111001110000000000 10011111110111111011011110110111011101110000100000 00001101110111101011011010001101011100110000000000 00000011010010101000000010000000010000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.100501, lagrangian_loss: 0.000069, attention_score_distillation_loss: 0.000196 loss: 0.059791, lagrangian_loss: 0.000036, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-20 15:43:00 Evaluating: accuracy: 0.8263, eval_loss: 0.5767, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4976, expected_sequence_sparsity: 0.8499, target_sparsity: 0.5, step: 294000 lambda_1: -0.2820, lambda_2: 1654.6136 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.64 0.45 0.17] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.64, 0.44, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000000010 11111111111111111111111110110000000010000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111001110000000000 10011111110111111011011110110111011101110000100000 00001101010111101011011010001101011100110000000000 00000010010010101011000010000000000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.116059, lagrangian_loss: 0.000229, attention_score_distillation_loss: 0.000192 ETA: 16:52:27 | Epoch 23 finished. Took 3761.81 seconds. loss: 0.056910, lagrangian_loss: 0.010474, attention_score_distillation_loss: 0.000184 ---------------------------------------------------------------------- time: 2023-07-20 15:53:15 Evaluating: accuracy: 0.8263, eval_loss: 0.5648, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4976, expected_sequence_sparsity: 0.8499, target_sparsity: 0.5, step: 296000 lambda_1: -0.3821, lambda_2: 1665.7172 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.64 0.45 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.64, 0.44, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000001000 11111111111111111111111110110100000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111001110000000000 10011111110111111011011110110111011101110000100000 00001101010111101011011010001101011100110000000000 00000010010010101000000010001000000000000100000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.025767, lagrangian_loss: 0.000779, attention_score_distillation_loss: 0.000193 loss: 0.030930, lagrangian_loss: 0.004002, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-20 16:03:30 Evaluating: accuracy: 0.82, eval_loss: 0.5809, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4976, expected_sequence_sparsity: 0.8499, target_sparsity: 0.5, step: 298000 lambda_1: -0.1996, lambda_2: 1677.0504 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.71 0.64 0.45 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.64, 0.44, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100101000000000000000 11111111111111111111111110111000000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111001110000000000 10011111110111111011011110110111011101110000100000 00001101010111101011011010001101010100110001000000 00000010010010101000000010000000000001010000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.044801, lagrangian_loss: 0.010394, attention_score_distillation_loss: 0.000191 loss: 0.021136, lagrangian_loss: 0.000130, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 16:13:42 Evaluating: accuracy: 0.8212, eval_loss: 0.5672, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4976, expected_sequence_sparsity: 0.8499, target_sparsity: 0.5, step: 300000 lambda_1: -0.3986, lambda_2: 1688.1934 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.7 0.63 0.45 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.64, 0.44, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001001000000000000 11111111111111111111111110110000010000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111001110000000000 10011111110111111011011110111111011101110000000000 00001101010111101011011010011101010100110000000000 00000010010010101000000010000000000001010000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.063277, lagrangian_loss: 0.000050, attention_score_distillation_loss: 0.000192 loss: 0.024980, lagrangian_loss: 0.000467, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-20 16:23:57 Evaluating: accuracy: 0.8269, eval_loss: 0.5721, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4976, expected_sequence_sparsity: 0.8499, target_sparsity: 0.5, step: 302000 lambda_1: -0.2192, lambda_2: 1699.8484 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.7 0.63 0.45 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.64, 0.44, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000010000000 11111111111111111111111110110000000000000000010000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101111001110000000000 10011111110111111111011110110111011101110000000000 00001101010111101011011010011101010100110000000000 10000010110010101000000010000000000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.030597, lagrangian_loss: 0.000271, attention_score_distillation_loss: 0.000190 loss: 0.028631, lagrangian_loss: 0.000199, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 16:34:10 Evaluating: accuracy: 0.8214, eval_loss: 0.5919, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4976, expected_sequence_sparsity: 0.8499, target_sparsity: 0.5, step: 304000 lambda_1: -0.2444, lambda_2: 1710.9690 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.64 0.7 0.63 0.45 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.64, 0.44, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001100000000000000 11111111111111111111111110110000000000100000000000 10111111111111111111111110110101011010100000000000 10111111111111111111111110111101111001110000000000 10011111110111111111011110110111011101110000000000 10001101010111101011011010001101010100110000000000 10000010010010101001000010000000000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.066751, lagrangian_loss: 0.003915, attention_score_distillation_loss: 0.000192 loss: 0.061097, lagrangian_loss: 0.000803, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 16:44:23 Evaluating: accuracy: 0.8249, eval_loss: 0.5787, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4976, expected_sequence_sparsity: 0.8499, target_sparsity: 0.5, step: 306000 lambda_1: -0.2375, lambda_2: 1721.8367 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.63 0.45 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.64, 0.44, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000000100 11111111111111111111111110110000000000001000000000 10111111111111111111111110110101011010100000000000 10111111111111111111111110111101111001110000000000 10011111110111111011011110110111011101110001000000 10001101010111101011011010001101010100110000000000 00000010110010101000000010000000000001000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.024371, lagrangian_loss: 0.000037, attention_score_distillation_loss: 0.000195 ETA: 15:48:46 | Epoch 24 finished. Took 3756.12 seconds. loss: 0.029833, lagrangian_loss: 0.009999, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-20 16:54:34 Evaluating: accuracy: 0.8231, eval_loss: 0.5742, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.498, expected_sequence_sparsity: 0.85, target_sparsity: 0.5, step: 308000 lambda_1: -0.2294, lambda_2: 1733.4829 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.63 0.45 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.62, 0.44, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001100000000000000 11111111111111111111111110110000000100000000000000 10111111111111111111111110110101011010000100000000 10111111111111111111111110111101111001110000000000 10011111110111111011011110110111011101110000000000 00001101010111101011011010011101010100110000000000 00000010110010101000000010000000000000000010000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.030120, lagrangian_loss: 0.003922, attention_score_distillation_loss: 0.000191 loss: 0.160197, lagrangian_loss: 0.008084, attention_score_distillation_loss: 0.000189 ---------------------------------------------------------------------- time: 2023-07-20 17:04:52 Evaluating: accuracy: 0.8276, eval_loss: 0.5766, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.498, expected_sequence_sparsity: 0.85, target_sparsity: 0.5, step: 310000 lambda_1: -0.3295, lambda_2: 1744.9882 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.69 0.62 0.45 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.62, 0.44, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000001000 11111111111111111111111110110100000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111001110000000000 10011111110111111011011110110111011101110000000000 00001101010111101011011010001101010101110000000000 00000010010010101001000010000000000001000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.042829, lagrangian_loss: 0.005697, attention_score_distillation_loss: 0.000190 loss: 0.044764, lagrangian_loss: 0.004333, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 17:15:08 Evaluating: accuracy: 0.8256, eval_loss: 0.576, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.498, expected_sequence_sparsity: 0.85, target_sparsity: 0.5, step: 312000 lambda_1: -0.3854, lambda_2: 1756.4161 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.69 0.62 0.45 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.62, 0.44, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110000001000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011001110100000000 10011111110111111011011110110111011101110000000000 00001101010111101011011010001101010100110010000000 10000010010010101001000010000000000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.041220, lagrangian_loss: -0.000018, attention_score_distillation_loss: 0.000194 loss: 0.016757, lagrangian_loss: 0.000640, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 17:25:22 Evaluating: accuracy: 0.8278, eval_loss: 0.562, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.498, expected_sequence_sparsity: 0.85, target_sparsity: 0.5, step: 314000 lambda_1: -0.1781, lambda_2: 1768.0020 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.62 0.45 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.62, 0.44, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111101001000000000000000 11111111111111111111111110110000000000000010000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011011110000000000 10011111110111111011011110110111011101110000000000 00001101010111101011011010001101010100110010000000 00000011010010101000000010000001000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.050545, lagrangian_loss: 0.001487, attention_score_distillation_loss: 0.000193 loss: 0.043700, lagrangian_loss: 0.004475, attention_score_distillation_loss: 0.000191 ---------------------------------------------------------------------- time: 2023-07-20 17:35:35 Evaluating: accuracy: 0.827, eval_loss: 0.5629, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.498, expected_sequence_sparsity: 0.85, target_sparsity: 0.5, step: 316000 lambda_1: -0.1148, lambda_2: 1778.9703 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.69 0.62 0.45 0.16] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.62, 0.44, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000100000000 11111111111111111111111110110001000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011001110010000000 10011111110111111011011110111101011101110000000000 00001101010111101011011010001101010100110001000000 00000010010010101000000010000100000001000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.043729, lagrangian_loss: 0.001906, attention_score_distillation_loss: 0.000195 loss: 0.044431, lagrangian_loss: 0.013187, attention_score_distillation_loss: 0.000197 ---------------------------------------------------------------------- time: 2023-07-20 17:45:50 Evaluating: accuracy: 0.8274, eval_loss: 0.5622, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.498, expected_sequence_sparsity: 0.85, target_sparsity: 0.5, step: 318000 lambda_1: -0.2308, lambda_2: 1789.8428 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.65 0.69 0.62 0.45 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.62, 0.44, 0.16] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111110001000000000000000 11111111111111111111111110110010000000000000000000 10111111111111111111111110110101111010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110111101011101110000000000 00001101010111101011011010011101010100110000000000 00000010010010101000000010000100000001000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.103160, lagrangian_loss: 0.024932, attention_score_distillation_loss: 0.000187 loss: 0.065867, lagrangian_loss: 0.009530, attention_score_distillation_loss: 0.000197 ETA: 14:45:15 | Epoch 25 finished. Took 3765.57 seconds. ---------------------------------------------------------------------- time: 2023-07-20 17:56:05 Evaluating: accuracy: 0.8264, eval_loss: 0.5632, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4981, expected_sequence_sparsity: 0.85, target_sparsity: 0.5, step: 320000 lambda_1: -0.1545, lambda_2: 1801.5266 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.65 0.7 0.61 0.45 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.62, 0.44, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.1, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000001000000 11111111111111111111111110110010000000000000000000 10111111111111111111111110110101011010000000010000 10111111111111111111111110111101011001110001000000 10011111110111111011011110110101011101110000100000 00001101010111101011011010001101010100110000100000 00000010010010101000000010000100000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.030089, lagrangian_loss: 0.002802, attention_score_distillation_loss: 0.000197 loss: 0.019152, lagrangian_loss: 0.002023, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 18:06:17 Evaluating: accuracy: 0.8279, eval_loss: 0.5726, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4985, expected_sequence_sparsity: 0.8501, target_sparsity: 0.5, step: 322000 lambda_1: -0.1425, lambda_2: 1812.9133 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.66 0.7 0.61 0.45 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.44, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001100000000000000 11111111111111111111111110110000010000000000000000 10111111111111111111111110110101111010000000000000 10111111111111111111111110111101011011110000000000 10011111110111111011011110110101011101110000000000 00001101010111101011011010001101010100110010000000 10000010010010101000000010000000000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.036913, lagrangian_loss: 0.012088, attention_score_distillation_loss: 0.000198 loss: 0.014206, lagrangian_loss: 0.000608, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 18:16:25 Evaluating: accuracy: 0.8289, eval_loss: 0.5666, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4985, expected_sequence_sparsity: 0.8501, target_sparsity: 0.5, step: 324000 lambda_1: -0.2789, lambda_2: 1824.3096 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.65 0.7 0.61 0.44 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.44, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100101000000000000000 11111111111111111111111111110000000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110110101011101110000000000 00001101010111101011011010001101010100110010000000 00000010010010101000000010000100000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.048748, lagrangian_loss: 0.010818, attention_score_distillation_loss: 0.000188 loss: 0.033394, lagrangian_loss: 0.001252, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-20 18:26:33 Evaluating: accuracy: 0.8291, eval_loss: 0.5768, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4985, expected_sequence_sparsity: 0.8501, target_sparsity: 0.5, step: 326000 lambda_1: -0.2564, lambda_2: 1835.3079 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.65 0.7 0.61 0.44 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.44, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001001000000000000 11111111111111111111111110110000000001000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110110101011101110000000000 00001101010111101011011010001101010100110010000000 00000010110010101000000010000000000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.043938, lagrangian_loss: 0.002200, attention_score_distillation_loss: 0.000188 loss: 0.041082, lagrangian_loss: 0.000048, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-20 18:36:45 Evaluating: accuracy: 0.8305, eval_loss: 0.5705, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4985, expected_sequence_sparsity: 0.8501, target_sparsity: 0.5, step: 328000 lambda_1: -0.2004, lambda_2: 1846.7994 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.62 0.56 0.65 0.7 0.6 0.44 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.44, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111101001000000000000000 11111111111111111111111111110000000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110110101011101110000000000 00001101010111101011011010001101010100110010000000 00000000110010101000000010000000000000010000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.052269, lagrangian_loss: 0.000947, attention_score_distillation_loss: 0.000193 loss: 0.020724, lagrangian_loss: 0.007215, attention_score_distillation_loss: 0.000197 ---------------------------------------------------------------------- time: 2023-07-20 18:46:54 Evaluating: accuracy: 0.8238, eval_loss: 0.6002, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4985, expected_sequence_sparsity: 0.8501, target_sparsity: 0.5, step: 330000 lambda_1: -0.1376, lambda_2: 1858.1134 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.65 0.7 0.6 0.44 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.44, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110000000000000010000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110110101011101110000000000 00001101010111101011011010001101010100110010000000 00000000110010101000000010000000000001000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.057873, lagrangian_loss: 0.001561, attention_score_distillation_loss: 0.000196 loss: 0.021398, lagrangian_loss: 0.006336, attention_score_distillation_loss: 0.000191 ETA: 13:41:38 | Epoch 26 finished. Took 3745.59 seconds. ---------------------------------------------------------------------- time: 2023-07-20 18:57:10 Evaluating: accuracy: 0.8265, eval_loss: 0.5913, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4985, expected_sequence_sparsity: 0.8501, target_sparsity: 0.5, step: 332000 lambda_1: -0.2884, lambda_2: 1869.6687 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.65 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.44, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110000000000000000100000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110110101011101110000000000 00001101010111101011011010001101010101110000000000 10000000010010101001000010000000000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.035918, lagrangian_loss: 0.059729, attention_score_distillation_loss: 0.000184 loss: 0.030849, lagrangian_loss: 0.012142, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 19:07:16 Evaluating: accuracy: 0.8247, eval_loss: 0.5724, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 334000 lambda_1: -0.2547, lambda_2: 1881.0992 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.65 0.69 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110000000100000000000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110010000000 00001101010111101011011010001101010100110000000000 00000000110010101001000010000000000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.021376, lagrangian_loss: 0.012388, attention_score_distillation_loss: 0.000187 loss: 0.028864, lagrangian_loss: 0.003056, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 19:17:32 Evaluating: accuracy: 0.8285, eval_loss: 0.5728, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 336000 lambda_1: -0.2108, lambda_2: 1891.6975 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001010000000000000 11111111111111111111111110110000000010000000000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110000010000 00001101010110101011011010001101010100110010000000 10000000010010101000010010000000000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.041644, lagrangian_loss: 0.004673, attention_score_distillation_loss: 0.000194 loss: 0.030534, lagrangian_loss: 0.000088, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 19:27:44 Evaluating: accuracy: 0.8281, eval_loss: 0.5807, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4985, expected_sequence_sparsity: 0.8501, target_sparsity: 0.5, step: 338000 lambda_1: -0.1054, lambda_2: 1902.4525 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.44, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110000000001000000000000 10111111111111111111111110110101011110000000000000 10111111111111111111111110111111011001110000000000 10011111110111111111011110010101011101110000000000 00001101010110101011011010001101010101110010000000 00000000010010101000000010000100000001000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.037921, lagrangian_loss: 0.003868, attention_score_distillation_loss: 0.000193 loss: 0.023443, lagrangian_loss: 0.008772, attention_score_distillation_loss: 0.000197 ---------------------------------------------------------------------- time: 2023-07-20 19:37:58 Evaluating: accuracy: 0.8267, eval_loss: 0.5668, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 340000 lambda_1: -0.0772, lambda_2: 1913.5365 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.65 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001001000000000000 11111111111111111111111110110000000010000000000000 10111111111111111111111110110101011110000000000000 10111111111111111111111110111101011001110001000000 10011111110111111011011110010101011101110001000000 00001101010110101011011010001101010101110000000000 00000000010010101000000010000100000000010000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.020711, lagrangian_loss: 0.000008, attention_score_distillation_loss: 0.000192 loss: 0.013291, lagrangian_loss: 0.001456, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 19:48:13 Evaluating: accuracy: 0.8274, eval_loss: 0.5767, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 342000 lambda_1: -0.1734, lambda_2: 1925.3940 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.65 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001010000000000000 11111111111111111111111110110000000000000000010000 10111111111111111111111110110101011110000000000000 10111111111111111111111110111101011001110000100000 10011111110111111011011110010101011101110001000000 00001101010110101011011010011101010100110000000000 00000000010010101000000010000100000001000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.035243, lagrangian_loss: 0.000902, attention_score_distillation_loss: 0.000197 loss: 0.036534, lagrangian_loss: 0.006290, attention_score_distillation_loss: 0.000196 ETA: 12:38:04 | Epoch 27 finished. Took 3741.76 seconds. ---------------------------------------------------------------------- time: 2023-07-20 19:58:16 Evaluating: accuracy: 0.8261, eval_loss: 0.5724, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 344000 lambda_1: -0.2415, lambda_2: 1936.3141 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.65 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110100000000000000000000 10111111111111111111111110110101011110000000000000 10111111111111111111111110111101011001110010000000 10011111110111111011011110010101011101110001000000 00001101010110101011011010001101010100110010000000 10000000010010101000000010000000000001000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.054014, lagrangian_loss: 0.000557, attention_score_distillation_loss: 0.000195 loss: 0.060956, lagrangian_loss: 0.003629, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 20:08:25 Evaluating: accuracy: 0.8263, eval_loss: 0.5752, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 346000 lambda_1: -0.2616, lambda_2: 1947.7352 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.65 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111101001000000000000000 11111111111111111111111110110000001000000000000000 10111111111111111111111110110101011110000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110000100000 00001101010110101011011010001101010100110010000000 10000001010010101000000010000000000000000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.026270, lagrangian_loss: 0.006022, attention_score_distillation_loss: 0.000194 loss: 0.037750, lagrangian_loss: 0.002698, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-20 20:18:36 Evaluating: accuracy: 0.8283, eval_loss: 0.5692, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 348000 lambda_1: -0.1599, lambda_2: 1958.7827 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.65 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111101001000000000000000 11111111111111111111111110110000010000000000000000 10111111111111111111111110110101011010000000100000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110100000000 00001101010110101011011010011101010100110000000000 10000000010010101000000010000000000001000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 loss: 0.051844, lagrangian_loss: 0.007456, attention_score_distillation_loss: 0.000188 loss: 0.017659, lagrangian_loss: 0.002712, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-20 20:28:47 Evaluating: accuracy: 0.8318, eval_loss: 0.5676, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 350000 lambda_1: -0.1420, lambda_2: 1970.2010 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.65 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000001000000000 11111111111111111111111110110000010000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110000100000 00001101010110101011011010001101010100110010000000 00000000010010101000000010000100000001000000000000 Best eval score so far: 0.8315 @ step 246000 epoch 20.05 Saving the best model so far: [Epoch 28 | Step: 350000 | MACs sparsity: 0.5073 | Score: 0.8318 | Loss: 0.5676] loss: 0.029457, lagrangian_loss: 0.002278, attention_score_distillation_loss: 0.000195 loss: 0.020111, lagrangian_loss: 0.004471, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-20 20:39:11 Evaluating: accuracy: 0.8259, eval_loss: 0.5876, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 352000 lambda_1: -0.1917, lambda_2: 1981.2241 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.65 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000100000000000 11111111111111111111111110110000010000000000000000 10111111111111111111111110110101011010100000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110011101011101110000000000 00001101010110101011011010001101010101110000000000 00000000110010101000000010000000000000000100000000 Best eval score so far: 0.8318 @ step 350000 epoch 28.52 loss: 0.031613, lagrangian_loss: 0.000800, attention_score_distillation_loss: 0.000195 loss: 0.026417, lagrangian_loss: 0.000408, attention_score_distillation_loss: 0.000191 ---------------------------------------------------------------------- time: 2023-07-20 20:49:22 Evaluating: accuracy: 0.8268, eval_loss: 0.5756, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 354000 lambda_1: -0.2646, lambda_2: 1992.4094 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.65 0.7 0.6 0.43 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000100000000000 11111111111111111111111110110000000100000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110011101011101110000000000 00001101010110101011011010001101010100110000100000 00000000010010101000000010000100000000010000000000 Best eval score so far: 0.8318 @ step 350000 epoch 28.52 loss: 0.031767, lagrangian_loss: 0.003157, attention_score_distillation_loss: 0.000189 loss: 0.044754, lagrangian_loss: 0.003941, attention_score_distillation_loss: 0.000196 ETA: 11:34:40 | Epoch 28 finished. Took 3755.34 seconds. ---------------------------------------------------------------------- time: 2023-07-20 20:59:37 Evaluating: accuracy: 0.8287, eval_loss: 0.5547, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 356000 lambda_1: -0.1327, lambda_2: 2003.5232 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.65 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000100000000000 11111111111111111111111110110001000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011111110010101011101110000000000 10001101010110101011011010001101010100110000000000 00000000010010101000000010000000010000010000000000 Best eval score so far: 0.8318 @ step 350000 epoch 28.52 loss: 0.032936, lagrangian_loss: 0.007283, attention_score_distillation_loss: 0.000191 loss: 0.025256, lagrangian_loss: 0.006489, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 21:09:55 Evaluating: accuracy: 0.8272, eval_loss: 0.5719, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 358000 lambda_1: -0.2443, lambda_2: 2014.7550 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.6 0.43 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001100000000000000 11111111111111111111111110110000010000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110011101011101110000000000 10001101010110101011011010001101010100110000000000 00000001010010101000000010000000000001000000000000 Best eval score so far: 0.8318 @ step 350000 epoch 28.52 loss: 0.055547, lagrangian_loss: 0.014522, attention_score_distillation_loss: 0.000197 loss: 0.030806, lagrangian_loss: -0.000003, attention_score_distillation_loss: 0.000197 ---------------------------------------------------------------------- time: 2023-07-20 21:20:01 Evaluating: accuracy: 0.8292, eval_loss: 0.5726, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 360000 lambda_1: -0.1466, lambda_2: 2025.8934 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001100000000000000 11111111111111111111111110110000000000000000010000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110010000000 10001101010110101011011010001101010100110000000000 10000000010010101000000010000000010000000000000000 Best eval score so far: 0.8318 @ step 350000 epoch 28.52 loss: 0.024830, lagrangian_loss: 0.007154, attention_score_distillation_loss: 0.000197 loss: 0.031047, lagrangian_loss: 0.005049, attention_score_distillation_loss: 0.000191 ---------------------------------------------------------------------- time: 2023-07-20 21:30:05 Evaluating: accuracy: 0.8296, eval_loss: 0.5604, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 362000 lambda_1: -0.2300, lambda_2: 2037.6338 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000010000000000 11111111111111111111111111110000000000000000000000 10111111111111111111111110110101011010000001000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110010000000 00001111010110101011011010001101010100110000000000 00000000110010101000000010001000000000000000000000 Best eval score so far: 0.8318 @ step 350000 epoch 28.52 loss: 0.044953, lagrangian_loss: 0.003727, attention_score_distillation_loss: 0.000196 loss: 0.029075, lagrangian_loss: 0.000000, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 21:40:19 Evaluating: accuracy: 0.8335, eval_loss: 0.5589, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 364000 lambda_1: -0.2677, lambda_2: 2048.8987 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.6 0.43 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100011000000000000000 11111111111111111111111110110000000010000000000000 10111111111111111111111110110101011010000010000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110011101011101110000000000 10001101010110101011011010001101010100110000000000 10000001010010101000000010000000000000000000000000 Best eval score so far: 0.8318 @ step 350000 epoch 28.52 Saving the best model so far: [Epoch 29 | Step: 364000 | MACs sparsity: 0.5073 | Score: 0.8335 | Loss: 0.5589] loss: 0.029514, lagrangian_loss: -0.000001, attention_score_distillation_loss: 0.000195 loss: 0.062538, lagrangian_loss: 0.003870, attention_score_distillation_loss: 0.000197 ---------------------------------------------------------------------- time: 2023-07-20 21:50:40 Evaluating: accuracy: 0.8311, eval_loss: 0.5611, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 366000 lambda_1: -0.1316, lambda_2: 2060.5735 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100011000000000000000 11111111111111111111111110110100000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110011101011101110000000000 10001101010110101011011010001101010100110000000000 10000000010010101001000010000000000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.045718, lagrangian_loss: 0.019253, attention_score_distillation_loss: 0.000189 loss: 0.039469, lagrangian_loss: 0.002182, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-20 22:00:55 Evaluating: accuracy: 0.8291, eval_loss: 0.5716, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 368000 lambda_1: -0.1560, lambda_2: 2071.9050 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100011000000000000000 11111111111111111111111110110100000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110010000000 10001101010110101011011010001101010100110000000000 00000001010010101000000010000000010000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.027550, lagrangian_loss: 0.002986, attention_score_distillation_loss: 0.000195 ETA: 10:31:37 | Epoch 29 finished. Took 3806.28 seconds. loss: 0.040342, lagrangian_loss: 0.004551, attention_score_distillation_loss: 0.000191 ---------------------------------------------------------------------- time: 2023-07-20 22:11:03 Evaluating: accuracy: 0.8292, eval_loss: 0.5659, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 370000 lambda_1: -0.1703, lambda_2: 2083.6162 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000100000 11111111111111111111111110110000000100000000000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110001000000 10001101010110101011011010001101010100110000000000 00000000010010101000000010000100000001000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.035473, lagrangian_loss: 0.001788, attention_score_distillation_loss: 0.000195 loss: 0.023328, lagrangian_loss: 0.000506, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-20 22:21:15 Evaluating: accuracy: 0.8281, eval_loss: 0.585, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4985, expected_sequence_sparsity: 0.8501, target_sparsity: 0.5, step: 372000 lambda_1: -0.0552, lambda_2: 2094.6367 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.44, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000010000000000 11111111111111111111111110110000001000000000000000 10111111111111111111111110110101011010001000000000 10111111111111111111111110111101011101110000000000 10011111110111111011111110010101011101110000000000 10001101010110101011011010001101010100110001000000 10000000010010101000000010000000000000000100000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.026920, lagrangian_loss: 0.009352, attention_score_distillation_loss: 0.000197 loss: 0.023108, lagrangian_loss: 0.006561, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 22:31:25 Evaluating: accuracy: 0.8308, eval_loss: 0.5593, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 374000 lambda_1: -0.2644, lambda_2: 2106.6602 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110010000000000000000000 10111111111111111111111110110101011010100000000000 10111111111111111111111110111101011101110000000000 10011111110111111011111110010101011101110000000000 00001101010110101011011010001101010101110000000000 10000000110010101000000010000000000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.018717, lagrangian_loss: 0.000310, attention_score_distillation_loss: 0.000190 loss: 0.023074, lagrangian_loss: 0.003757, attention_score_distillation_loss: 0.000197 ---------------------------------------------------------------------- time: 2023-07-20 22:41:37 Evaluating: accuracy: 0.828, eval_loss: 0.5762, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 376000 lambda_1: -0.1871, lambda_2: 2117.9185 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000010000 11111111111111111111111110110000000000000000010000 10111111111111111111111110110101011010000000010000 10111111111111111111111110111101011101110000000000 10011111110111111011011110011101011101110000000000 00001101010110101011011010001101010101110000000000 10000001010010101000000010000000000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.035999, lagrangian_loss: 0.000592, attention_score_distillation_loss: 0.000196 loss: 0.023555, lagrangian_loss: 0.000305, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-20 22:51:46 Evaluating: accuracy: 0.8302, eval_loss: 0.5713, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4985, expected_sequence_sparsity: 0.8501, target_sparsity: 0.5, step: 378000 lambda_1: -0.2117, lambda_2: 2129.1008 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.44, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000000100 11111111111111111111111110110000100000000000000000 10111111111111111111111110110101011010010000000000 10111111111111111111111110111101011101110000000000 10011111110111111111011110010101011101110000000000 10001101010110101011011010001101010101110000000000 10000000010010101000000010000000010000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.031055, lagrangian_loss: 0.004859, attention_score_distillation_loss: 0.000191 loss: 0.027642, lagrangian_loss: 0.003668, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-20 23:01:59 Evaluating: accuracy: 0.8302, eval_loss: 0.5674, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 380000 lambda_1: -0.2256, lambda_2: 2139.9514 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000001000000 11111111111111111111111110110000000010000000000000 10111111111111111111111110110101111010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110010000000 00001101010110101011011010001101010101110000000000 10000000110010101000000010000000000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.340788, lagrangian_loss: 0.004377, attention_score_distillation_loss: 0.000188 ETA: 9:28:13 | Epoch 30 finished. Took 3742.15 seconds. loss: 0.039616, lagrangian_loss: 0.000185, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-20 23:12:13 Evaluating: accuracy: 0.8293, eval_loss: 0.5712, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 382000 lambda_1: -0.1231, lambda_2: 2151.1965 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100011000000000000000 11111111111111111111111110110010000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110001000000 10001101010110101011011010001101010100110000000000 00000000010010101001000010000000000000010000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.021048, lagrangian_loss: 0.000038, attention_score_distillation_loss: 0.000196 loss: 0.026272, lagrangian_loss: 0.000823, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-20 23:22:26 Evaluating: accuracy: 0.8308, eval_loss: 0.5812, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 384000 lambda_1: -0.1895, lambda_2: 2162.2080 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111101001000000000000000 11111111111111111111111110110010000000000000000000 10111111111111111111111110110101011010000010000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110001000000 00001101010110101011011010011101010100110000000000 10000000010010101000010010000000000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.114770, lagrangian_loss: 0.001048, attention_score_distillation_loss: 0.000194 loss: 0.015411, lagrangian_loss: 0.001401, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 23:32:38 Evaluating: accuracy: 0.8294, eval_loss: 0.5598, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 386000 lambda_1: -0.1814, lambda_2: 2173.1755 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110010000000000000000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110001000000 00001101010110101011011010011101010100110000000000 00000000010010101000000010000100000000000100000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.025749, lagrangian_loss: 0.002628, attention_score_distillation_loss: 0.000196 loss: 0.023101, lagrangian_loss: 0.000161, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-20 23:42:47 Evaluating: accuracy: 0.8281, eval_loss: 0.5801, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 388000 lambda_1: -0.1868, lambda_2: 2184.7368 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000100000000 11111111111111111111111110110000010000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011001110010000000 10011111110111111011011110010101011101110001000000 10001101010110101011011010001101010100110000000000 10000000010010101000000010000000000001000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.014955, lagrangian_loss: 0.007844, attention_score_distillation_loss: 0.000195 loss: 0.029201, lagrangian_loss: 0.001751, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-20 23:52:54 Evaluating: accuracy: 0.8293, eval_loss: 0.5619, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 390000 lambda_1: -0.1527, lambda_2: 2195.9622 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000100000000000 11111111111111111111111110110000000000010000000000 10111111111111111111111110110101011010000010000000 10111111111111111111111110111101011001110010000000 10011111110111111011011110010101011101110001000000 00001101010110101011011010011101010100110000000000 10000000010010101000000010000000000001000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.019106, lagrangian_loss: 0.015853, attention_score_distillation_loss: 0.000193 loss: 0.020821, lagrangian_loss: 0.003154, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-21 00:03:08 Evaluating: accuracy: 0.8319, eval_loss: 0.5868, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4985, expected_sequence_sparsity: 0.8501, target_sparsity: 0.5, step: 392000 lambda_1: -0.1269, lambda_2: 2207.3472 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.44, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000001000 11111111111111111111111110110000000000010000000000 10111111111111111111111110110101011010001000000000 10111111111111111111111110111101011001110010000000 10011111110111111011011110010101011101110001000000 10001101010110101011011010001101010100110010000000 00000000010010101000000010000100000001000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.045349, lagrangian_loss: 0.000750, attention_score_distillation_loss: 0.000190 ETA: 8:24:54 | Epoch 31 finished. Took 3744.81 seconds. loss: 0.039805, lagrangian_loss: 0.003190, attention_score_distillation_loss: 0.000189 ---------------------------------------------------------------------- time: 2023-07-21 00:13:23 Evaluating: accuracy: 0.8263, eval_loss: 0.5863, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 394000 lambda_1: -0.1749, lambda_2: 2219.1409 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000010000000 11111111111111111111111110110000100000000000000000 10111111111111111111111110110101011010000010000000 10111111111111111111111110111101011011110000000000 10011111110111111011011110010101011101110010000000 00001101010110101011011010011101010100110000000000 00000000010010101001000010000000000001000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.026496, lagrangian_loss: 0.002293, attention_score_distillation_loss: 0.000190 loss: 0.041919, lagrangian_loss: 0.002875, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-21 00:23:34 Evaluating: accuracy: 0.8319, eval_loss: 0.5673, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 396000 lambda_1: -0.1793, lambda_2: 2229.8420 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000010000000 11111111111111111111111110110001000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011011110000000000 10011111110111111011011110011101011101110000000000 00001101010110101011011010011101010100110000000000 10000000010010101001000010000000000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.026051, lagrangian_loss: 0.003765, attention_score_distillation_loss: 0.000195 loss: 0.032978, lagrangian_loss: 0.011143, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-21 00:33:43 Evaluating: accuracy: 0.829, eval_loss: 0.5728, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4985, expected_sequence_sparsity: 0.8501, target_sparsity: 0.5, step: 398000 lambda_1: -0.1832, lambda_2: 2241.4897 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.44, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110000000000000001000000 10111111111111111111111110110101011010000001000000 10111111111111111111111110111101011011110000000000 10011111110111111011011110011101011101110000000000 00001101010110101011011010011101010100110000100000 10000000010010101010000010000000000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.016480, lagrangian_loss: 0.000833, attention_score_distillation_loss: 0.000193 loss: 0.019750, lagrangian_loss: 0.010164, attention_score_distillation_loss: 0.000197 ---------------------------------------------------------------------- time: 2023-07-21 00:43:52 Evaluating: accuracy: 0.8332, eval_loss: 0.562, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4985, expected_sequence_sparsity: 0.8501, target_sparsity: 0.5, step: 400000 lambda_1: -0.1584, lambda_2: 2252.4983 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.44, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000100000 11111111111111111111111110110001000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110011101011101110000000000 10001101010110101011011010011101010100110000000000 00000000010010101000000010000100000000010000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.013457, lagrangian_loss: 0.000392, attention_score_distillation_loss: 0.000191 loss: 0.026791, lagrangian_loss: 0.005832, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-21 00:53:56 Evaluating: accuracy: 0.8289, eval_loss: 0.5778, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4985, expected_sequence_sparsity: 0.8501, target_sparsity: 0.5, step: 402000 lambda_1: -0.1861, lambda_2: 2263.2148 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.44, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000100000 11111111111111111111111110110000000000000000010000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110011101011101110000000000 00001101010110101011011010011101010101110000000000 10000000010010101001000010000000000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.032463, lagrangian_loss: 0.000041, attention_score_distillation_loss: 0.000193 loss: 0.022996, lagrangian_loss: 0.003565, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-21 01:04:09 Evaluating: accuracy: 0.8288, eval_loss: 0.5691, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 404000 lambda_1: -0.1901, lambda_2: 2274.3855 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100101000000000000000 11111111111111111111111110110000000000000000010000 10111111111111111111111110110101011010000010000000 10111111111111111111111110111111011001110000000000 10011111110111111011011110011101011101110000000000 00001101010110101011011010011101010100110000000000 00000001010010101001000010000000000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.031935, lagrangian_loss: 0.002314, attention_score_distillation_loss: 0.000197 ETA: 7:21:36 | Epoch 32 finished. Took 3734.87 seconds. loss: 0.020903, lagrangian_loss: 0.000740, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-21 01:14:15 Evaluating: accuracy: 0.828, eval_loss: 0.5609, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 406000 lambda_1: -0.1431, lambda_2: 2285.2622 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110000000000000010000000 10111111111111111111111110110101111010000000000000 10111111111111111111111110111101011001110000100000 10011111110111111011011110010101011101110001000000 00001101010110101011011010011101010100110000000000 00000000010010101001000010000100000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.018104, lagrangian_loss: 0.003344, attention_score_distillation_loss: 0.000195 loss: 0.025681, lagrangian_loss: 0.009526, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-21 01:24:27 Evaluating: accuracy: 0.8276, eval_loss: 0.5751, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 408000 lambda_1: -0.1784, lambda_2: 2297.0327 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.6 0.43 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000010000000000 11111111111111111111111110110100000000000000000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101011001110010000000 10011111110111111011011110011101011101110000000000 00001101010110101011011010011101010100110000000000 10000000010010101001000010000000000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.018792, lagrangian_loss: 0.025156, attention_score_distillation_loss: 0.000197 loss: 0.026418, lagrangian_loss: 0.003601, attention_score_distillation_loss: 0.000189 ---------------------------------------------------------------------- time: 2023-07-21 01:34:36 Evaluating: accuracy: 0.8269, eval_loss: 0.5815, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 410000 lambda_1: -0.1600, lambda_2: 2308.2061 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.6 0.43 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000100000000 11111111111111111111111110110000010000000000000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101011001110000001000 10011111110111111011011110010101011101110100000000 00001101010110101011011010001101010101110000000000 10000001010010101000000010000000000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.024232, lagrangian_loss: 0.005429, attention_score_distillation_loss: 0.000197 loss: 0.015136, lagrangian_loss: 0.007900, attention_score_distillation_loss: 0.000197 ---------------------------------------------------------------------- time: 2023-07-21 01:44:44 Evaluating: accuracy: 0.8311, eval_loss: 0.5691, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 412000 lambda_1: -0.1209, lambda_2: 2319.5144 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100101000000000000000 11111111111111111111111110110000000000000100000000 10111111111111111111111110110101011010000010000000 10111111111111111111111110111101011011110000000000 10011111110111111011011110010101011101110010000000 10001101010110101011011010001101010100110000000000 00000000010010101001000010000000000001000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.034978, lagrangian_loss: 0.008862, attention_score_distillation_loss: 0.000196 loss: 0.018479, lagrangian_loss: 0.000631, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-21 01:54:58 Evaluating: accuracy: 0.8279, eval_loss: 0.564, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 414000 lambda_1: -0.0989, lambda_2: 2330.6313 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000001000000000 11111111111111111111111110110000000000000100000000 10111111111111111111111110110101011010000000000100 10111111111111111111111110111101011001110100000000 10011111110111111011011110010111011101110000000000 10001101010110101011011010001101010100110000000000 00000000010010101001000010000000000000000100000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.016491, lagrangian_loss: 0.006203, attention_score_distillation_loss: 0.000189 loss: 0.021359, lagrangian_loss: 0.001284, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-21 02:05:05 Evaluating: accuracy: 0.828, eval_loss: 0.5687, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 416000 lambda_1: -0.1382, lambda_2: 2341.4177 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.6 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100011000000000000000 11111111111111111111111110110000000000000001000000 10111111111111111111111110110101011010100000000000 10111111111111111111111110111101011001110000100000 10011111110111111011011110011101011101110000000000 10001101010110101011011010001101010100110000000000 10000000010010101000000010000000000001000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.025172, lagrangian_loss: 0.018751, attention_score_distillation_loss: 0.000197 loss: 0.026234, lagrangian_loss: 0.003148, attention_score_distillation_loss: 0.000197 ETA: 6:18:21 | Epoch 33 finished. Took 3727.64 seconds. ---------------------------------------------------------------------- time: 2023-07-21 02:15:11 Evaluating: accuracy: 0.8312, eval_loss: 0.5695, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 418000 lambda_1: -0.1523, lambda_2: 2352.8794 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.59 0.43 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100011000000000000000 11111111111111111111111110110010000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011001110100000000 10011111110111111011011110011101011101110000000000 10001101010110101011011010001101010100110000000000 10000000010010101000000010000000010000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.165282, lagrangian_loss: 0.000794, attention_score_distillation_loss: 0.000191 loss: 0.011602, lagrangian_loss: 0.001234, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-21 02:25:15 Evaluating: accuracy: 0.8287, eval_loss: 0.5709, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 420000 lambda_1: -0.1413, lambda_2: 2363.9360 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.59 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001001000000000000 11111111111111111111111110110000001000000000000000 10111111111111111111111110110101011010000001000000 10111111111111111111111110111101011001110000100000 10011111110111111011011110011101011101110000000000 10001101010110101011011010001101010100110000000000 10000001010010101000000010000000000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.012007, lagrangian_loss: 0.009238, attention_score_distillation_loss: 0.000196 loss: 0.012510, lagrangian_loss: 0.000068, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-21 02:35:31 Evaluating: accuracy: 0.8289, eval_loss: 0.5622, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 422000 lambda_1: -0.1941, lambda_2: 2375.4771 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.59 0.43 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001100000000000000 11111111111111111111111110110000000000000000010000 10111111111111111111111110110101011010100000000000 10111111111111111111111110111101011001110000100000 10011111110111111011111110010101011101110000000000 10001101010110101011011010001101010100110000000000 00000000110010101000000010001000000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.034932, lagrangian_loss: 0.000152, attention_score_distillation_loss: 0.000194 loss: 0.016548, lagrangian_loss: 0.000773, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-21 02:45:36 Evaluating: accuracy: 0.8288, eval_loss: 0.569, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4987, expected_sequence_sparsity: 0.8502, target_sparsity: 0.5, step: 424000 lambda_1: -0.1500, lambda_2: 2387.1641 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.59 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.6, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100011000000000000000 11111111111111111111111110110000000000000100000000 10111111111111111111111110110101011010100000000000 10111111111111111111111110111101011001110000100000 10011111110111111011111110010101011101110000000000 10001101010110101011011010001101010100110000000000 00000000010010101001000010000100000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.015092, lagrangian_loss: 0.004106, attention_score_distillation_loss: 0.000191 loss: 0.019305, lagrangian_loss: 0.003221, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-21 02:55:44 Evaluating: accuracy: 0.832, eval_loss: 0.5767, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 426000 lambda_1: -0.1141, lambda_2: 2398.0400 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.59 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111101001000000000000000 11111111111111111111111110111000000000000000000000 10111111111111111111111110110101011010000000010000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110000000000 10001101010110101011011010001101010100110000000000 00000000010010101001000010000000010000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.031008, lagrangian_loss: 0.000790, attention_score_distillation_loss: 0.000195 loss: 0.017847, lagrangian_loss: 0.002225, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-21 03:05:54 Evaluating: accuracy: 0.8278, eval_loss: 0.5639, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 428000 lambda_1: -0.2020, lambda_2: 2409.8477 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.59 0.43 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000100000000000 11111111111111111111111110110000001000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110000000000 10001101010110101011011010001101010100110000000000 10000000010010101001000010000000000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.019273, lagrangian_loss: 0.000005, attention_score_distillation_loss: 0.000194 loss: 0.028445, lagrangian_loss: 0.023181, attention_score_distillation_loss: 0.000188 ETA: 5:15:10 | Epoch 34 finished. Took 3728.83 seconds. ---------------------------------------------------------------------- time: 2023-07-21 03:16:01 Evaluating: accuracy: 0.83, eval_loss: 0.5676, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 430000 lambda_1: -0.1498, lambda_2: 2421.2827 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.59 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100011000000000000000 11111111111111111111111110110000000000010000000000 10111111111111111111111110110101011010000010000000 10111111111111111111111110111101011011110000000000 10011111110111111011011110010101011101110000000000 10001101010110101011011010001101010100110000000000 10000000010010101001000010000000000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.022899, lagrangian_loss: 0.000828, attention_score_distillation_loss: 0.000194 loss: 0.016534, lagrangian_loss: 0.000403, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-21 03:26:09 Evaluating: accuracy: 0.83, eval_loss: 0.57, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 432000 lambda_1: -0.2015, lambda_2: 2432.6482 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.58 0.43 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000001000000000 11111111111111111111111110110100000000000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101011011110000000000 10011111110111111011011110010101011101110000000000 10001101010110101011011010001101010100110000000000 10000000010010101000000010000100000000000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 loss: 0.014346, lagrangian_loss: 0.004512, attention_score_distillation_loss: 0.000194 loss: 0.019248, lagrangian_loss: 0.001451, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-21 03:36:21 Evaluating: accuracy: 0.8343, eval_loss: 0.564, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 434000 lambda_1: -0.1198, lambda_2: 2444.1265 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001010000000000000 11111111111111111111111110110000001000000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101011011110000000000 10011111110111111011011110010101011101110000000000 10001101010110101011011010001101010100110000000000 10000000010010101000000010000000000001000000000000 Best eval score so far: 0.8335 @ step 364000 epoch 29.66 Saving the best model so far: [Epoch 35 | Step: 434000 | MACs sparsity: 0.5073 | Score: 0.8343 | Loss: 0.564] loss: 0.025968, lagrangian_loss: 0.003961, attention_score_distillation_loss: 0.000197 loss: 0.021390, lagrangian_loss: 0.004695, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-21 03:46:38 Evaluating: accuracy: 0.828, eval_loss: 0.5765, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 436000 lambda_1: -0.2750, lambda_2: 2455.4219 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.58 0.43 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001010000000000000 11111111111111111111111110111000000000000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110000000000 00001101010110101011011010001101010100110000000001 00000000010010101000000010000000000000000000000011 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.029678, lagrangian_loss: 0.012689, attention_score_distillation_loss: 0.000196 loss: 0.016226, lagrangian_loss: 0.038367, attention_score_distillation_loss: 0.000188 ---------------------------------------------------------------------- time: 2023-07-21 03:56:45 Evaluating: accuracy: 0.8292, eval_loss: 0.5741, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 438000 lambda_1: -0.1154, lambda_2: 2467.0159 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001010000000000000 11111111111111111111111110110100000000000000000000 10111111111111111111111110110101011010000000100000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110000000000 00001101010110101011011010001101010100110001000000 00000000110010101000000010000000010000000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.013632, lagrangian_loss: 0.006622, attention_score_distillation_loss: 0.000195 loss: 0.040665, lagrangian_loss: 0.001807, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-21 04:06:59 Evaluating: accuracy: 0.8292, eval_loss: 0.5767, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 440000 lambda_1: -0.1236, lambda_2: 2478.3843 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.58 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001010000000000000 11111111111111111111111110110000010000000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110000000000 00001101010110101011011010001101010101110000000000 00000001110010101000000010000000000000000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.023297, lagrangian_loss: 0.001198, attention_score_distillation_loss: 0.000194 loss: 0.019843, lagrangian_loss: 0.000205, attention_score_distillation_loss: 0.000192 ETA: 4:12:03 | Epoch 35 finished. Took 3739.82 seconds. ---------------------------------------------------------------------- time: 2023-07-21 04:17:08 Evaluating: accuracy: 0.8286, eval_loss: 0.5776, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 442000 lambda_1: -0.1172, lambda_2: 2489.4673 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.58 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000100000 11111111111111111111111110110001000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110000000000 00001101010110101011011010011101010100110000000000 00000000010010101000000010000000010000010000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.014825, lagrangian_loss: 0.001972, attention_score_distillation_loss: 0.000192 loss: 0.024123, lagrangian_loss: 0.005700, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-21 04:27:17 Evaluating: accuracy: 0.828, eval_loss: 0.5772, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 444000 lambda_1: -0.1574, lambda_2: 2500.9089 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.43 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000100000 11111111111111111111111110110000000000000100000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011110010101011101110000000000 00001101010110101011011010011101010100110000000000 00000000110010101000000010000000000000010000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.017474, lagrangian_loss: 0.004079, attention_score_distillation_loss: 0.000192 loss: 0.030164, lagrangian_loss: 0.004371, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-21 04:37:20 Evaluating: accuracy: 0.8274, eval_loss: 0.5843, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 446000 lambda_1: -0.1525, lambda_2: 2512.0828 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100101000000000000000 11111111111111111111111110110000000000000000010000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011010011101011101110000000000 00001101010110101011011010011101010100110000000000 10000000010010101001000010000000000000000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.039491, lagrangian_loss: 0.000780, attention_score_distillation_loss: 0.000191 loss: 0.029150, lagrangian_loss: 0.000391, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-21 04:47:29 Evaluating: accuracy: 0.8254, eval_loss: 0.5878, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 448000 lambda_1: -0.1192, lambda_2: 2523.4692 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110000000001000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101111001110000000000 10011111110111111011011010011101011101110000000000 00001101010110101011011010001101010100110001000000 00000000110010101001000010000000000000000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.031591, lagrangian_loss: 0.016773, attention_score_distillation_loss: 0.000190 loss: 0.015674, lagrangian_loss: 0.000346, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-21 04:57:40 Evaluating: accuracy: 0.8285, eval_loss: 0.5801, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 450000 lambda_1: -0.1260, lambda_2: 2535.4141 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100011000000000000000 11111111111111111111111110111000000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111101011001110000100000 10011111110111111011011010011101011101110000000000 00001101010110101011011010001101010100110001000000 10000000010010101001000010000000000000000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.015532, lagrangian_loss: 0.004891, attention_score_distillation_loss: 0.000195 loss: 0.035260, lagrangian_loss: 0.000110, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-21 05:07:53 Evaluating: accuracy: 0.8293, eval_loss: 0.5756, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 452000 lambda_1: -0.0828, lambda_2: 2546.4072 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.43 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111101001000000000000000 11111111111111111111111110110000000000000001000000 10111111111111111111111110110101011010001000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011010011101011101110000000000 00001101010110101011011010001101010101110000000000 10000000110010101000000010000000000000000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.022517, lagrangian_loss: 0.000015, attention_score_distillation_loss: 0.000195 loss: 0.032551, lagrangian_loss: 0.005885, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-21 05:17:56 Evaluating: accuracy: 0.8263, eval_loss: 0.5756, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 454000 lambda_1: -0.1861, lambda_2: 2557.6995 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.42 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111101001000000000000000 11111111111111111111111110110000000000000000010000 10111111111111111111111110110101011010001000000000 10111111111111111111111110111101011001110000100000 10011111110111111011011010011101011101110000000000 00001101010110101011011010011101010100110000000000 00000000010010101000000010000100000001000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.019954, lagrangian_loss: 0.004547, attention_score_distillation_loss: 0.000196 ETA: 3:09:02 | Epoch 36 finished. Took 3776.11 seconds. loss: 0.019918, lagrangian_loss: 0.000678, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-21 05:28:05 Evaluating: accuracy: 0.8303, eval_loss: 0.5741, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 456000 lambda_1: -0.2442, lambda_2: 2568.6523 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.58 0.42 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000000001 11111111111111111111111110110000000000000000000100 10111111111111111111111110110101011010000000100000 10111111111111111111111110111101011011110000000000 10011111110111111011011010010101011101110010000000 00001101010110101011011010011101010100110000000000 10000000010010101000000010000000000001000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.011706, lagrangian_loss: 0.000038, attention_score_distillation_loss: 0.000194 loss: 0.023347, lagrangian_loss: 0.018485, attention_score_distillation_loss: 0.000191 ---------------------------------------------------------------------- time: 2023-07-21 05:38:14 Evaluating: accuracy: 0.8287, eval_loss: 0.5773, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 458000 lambda_1: -0.1296, lambda_2: 2579.7805 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.58 0.42 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111100001000000000000000 11111111111111111111111110110000000000000010000000 10111111111111111111111110110101011110000000000000 10111111111111111111111110111101011001110100000000 10011111110111111011011010010101011101110001000000 00001101010110101011011010011101010100110000000000 00000000010010101000000010000100000001000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.010140, lagrangian_loss: 0.010016, attention_score_distillation_loss: 0.000185 loss: 0.015018, lagrangian_loss: 0.036993, attention_score_distillation_loss: 0.000185 ---------------------------------------------------------------------- time: 2023-07-21 05:48:19 Evaluating: accuracy: 0.8274, eval_loss: 0.5851, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 460000 lambda_1: -0.1160, lambda_2: 2591.1191 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.42 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100101000000000000000 11111111111111111111111110110000000000100000000000 10111111111111111111111110110101011110000000000000 10111111111111111111111110111101011001110100000000 10011111110111111011011010010101011111110000000000 00001101010110101011011010011101010100110000000000 00000000010010101000000010000100000001000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.021352, lagrangian_loss: 0.005505, attention_score_distillation_loss: 0.000195 loss: 0.019470, lagrangian_loss: 0.009287, attention_score_distillation_loss: 0.000193 ---------------------------------------------------------------------- time: 2023-07-21 05:58:20 Evaluating: accuracy: 0.8263, eval_loss: 0.5744, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 462000 lambda_1: -0.2001, lambda_2: 2601.9514 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.42 0.14] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001100000000000000 11111111111111111111111110110000001000000000000000 10111111111111111111111110110101011110000000000000 10111111111111111111111110111101011001110000010000 10011111110111111011011010010101011101110001000000 00001101010110101011011010011101010100110000000000 00000000010010101001000010000000000001000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.013354, lagrangian_loss: 0.007458, attention_score_distillation_loss: 0.000195 loss: 0.019562, lagrangian_loss: 0.002764, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-21 06:08:22 Evaluating: accuracy: 0.8302, eval_loss: 0.5672, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 464000 lambda_1: -0.0485, lambda_2: 2613.2827 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.42 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000001000000 11111111111111111111111110110010000000000000000000 11111111111111111111111110110101011010000000000000 10111111111111111111111110111101011001110000010000 10011111110111111011011010010101011111110000000000 00001101010110101011011010011101010100110000000000 10000000010010101000000010000000000000010000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.024213, lagrangian_loss: 0.001418, attention_score_distillation_loss: 0.000195 loss: 0.025169, lagrangian_loss: 0.008550, attention_score_distillation_loss: 0.000196 ---------------------------------------------------------------------- time: 2023-07-21 06:18:30 Evaluating: accuracy: 0.8301, eval_loss: 0.5694, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 466000 lambda_1: -0.1369, lambda_2: 2625.0488 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.42 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000010000000000 11111111111111111111111110110000010000000000000000 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011010010101011101110000010000 10001101010110101011011010001101010100110000000000 10000000010010101000000010000000000001000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.016931, lagrangian_loss: 0.009058, attention_score_distillation_loss: 0.000195 ETA: 2:05:57 | Epoch 37 finished. Took 3710.45 seconds. loss: 0.014478, lagrangian_loss: 0.000500, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-21 06:28:38 Evaluating: accuracy: 0.8305, eval_loss: 0.5702, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 468000 lambda_1: -0.1416, lambda_2: 2636.5891 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.42 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001010000000000000 11111111111111111111111111110000000000000000000000 10111111111111111111111110110101011010000010000000 10111111111111111111111110111101111001110000000000 10011111110111111011011010010101011101110000001000 00001101010110101011011010001101010100110100000000 10000000110010101000000010000000000000000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.017361, lagrangian_loss: 0.001492, attention_score_distillation_loss: 0.000193 loss: 0.024293, lagrangian_loss: 0.013582, attention_score_distillation_loss: 0.000197 ---------------------------------------------------------------------- time: 2023-07-21 06:38:40 Evaluating: accuracy: 0.83, eval_loss: 0.5765, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 470000 lambda_1: -0.0463, lambda_2: 2647.9695 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.58 0.42 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000100000000000 11111111111111111111111110110100000000000000000000 10111111111111111111111110110101011010000010000000 10111111111111111111111110111101011001111000000000 10111111110111111011011010010101011101110000000000 00001101010110101011011010001101010100110000100000 10000000010010101000000010000000000001000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.025260, lagrangian_loss: 0.005498, attention_score_distillation_loss: 0.000195 loss: 0.015104, lagrangian_loss: 0.003457, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-21 06:48:45 Evaluating: accuracy: 0.8292, eval_loss: 0.5829, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 472000 lambda_1: -0.1007, lambda_2: 2659.1738 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.58 0.42 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000010000000 11111111111111111111111110110100000000000000000000 10111111111111111111111110110101011110000000000000 10111111111111111111111110111101011001110010000000 10011111110111111011011010010101011101110010000000 00001101010110101011011010001101010100110000100000 00000000010010101001000010000000000000000100000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.041425, lagrangian_loss: 0.007234, attention_score_distillation_loss: 0.000194 loss: 0.014260, lagrangian_loss: 0.019612, attention_score_distillation_loss: 0.000198 ---------------------------------------------------------------------- time: 2023-07-21 06:58:51 Evaluating: accuracy: 0.8291, eval_loss: 0.5843, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 474000 lambda_1: -0.0749, lambda_2: 2670.1899 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.58 0.42 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000100000000000 11111111111111111111111110110100000000000000000000 10111111111111111111111110110101011010010000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011010010101011101110010000000 00001101010110101011011010001101010101110000000000 00000000010010101001000010000000010000000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.020642, lagrangian_loss: 0.000191, attention_score_distillation_loss: 0.000196 loss: 0.034550, lagrangian_loss: 0.002893, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-21 07:08:58 Evaluating: accuracy: 0.8299, eval_loss: 0.5774, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 476000 lambda_1: -0.1137, lambda_2: 2681.5435 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.42 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000100000000000 11111111111111111111111110110001000000000000000000 10111111111111111111111110110101011110000000000000 10111111111111111111111110111101011001110000100000 10011111110111111011011010010101011101110000100000 10001101010110101011011010001101010100110000000000 00000000110010101001000010000000000000000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.020795, lagrangian_loss: 0.001985, attention_score_distillation_loss: 0.000195 loss: 0.025447, lagrangian_loss: 0.000057, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-21 07:19:01 Evaluating: accuracy: 0.8315, eval_loss: 0.5783, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 478000 lambda_1: -0.1310, lambda_2: 2692.5757 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.41 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001010000000000000 11111111111111111111111110110000000000000000000010 10111111111111111111111110110111011010000000000000 10111111111111111111111110111101011001110010000000 10011111110111111011011010011101011101110000000000 00001101010110101011011010011101010100110000000000 10000000110010101000000010000000000000000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.022536, lagrangian_loss: 0.005875, attention_score_distillation_loss: 0.000189 ETA: 1:02:56 | Epoch 38 finished. Took 3702.87 seconds. loss: 0.025166, lagrangian_loss: 0.010135, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-21 07:29:04 Evaluating: accuracy: 0.828, eval_loss: 0.5774, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 480000 lambda_1: -0.0549, lambda_2: 2703.4509 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.41 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001001000000000000 11111111111111111111111110110010000000000000000000 10111111111111111111111110110101011010100000000000 10111111111111111111111110111111011001110000000000 10011111110111111011011010011101011101110000000000 00001101010110101011011010001101010101110000000000 00000001010010101000000010000000000000000100000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.031499, lagrangian_loss: 0.004756, attention_score_distillation_loss: 0.000190 loss: 0.023769, lagrangian_loss: 0.000304, attention_score_distillation_loss: 0.000194 ---------------------------------------------------------------------- time: 2023-07-21 07:39:07 Evaluating: accuracy: 0.8266, eval_loss: 0.5805, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 482000 lambda_1: -0.1090, lambda_2: 2715.0171 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.41 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000100000000000 11111111111111111111111110110010000000000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111111011001110000000000 10011111110111111011011010011101011101110000000000 00001101010110101011011010001101010101110000000000 00000000010010101000000010000100010000000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.030531, lagrangian_loss: 0.004530, attention_score_distillation_loss: 0.000196 loss: 0.012803, lagrangian_loss: 0.000058, attention_score_distillation_loss: 0.000191 ---------------------------------------------------------------------- time: 2023-07-21 07:49:09 Evaluating: accuracy: 0.8328, eval_loss: 0.5699, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 484000 lambda_1: -0.0588, lambda_2: 2725.9648 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.7 0.58 0.41 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000010000 11111111111111111111111110110000010000000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111111011001110000000000 10011111110111111011011010011101011101110000000000 10001101010110101011011010001101010100110000000000 00000000010010101000000010000100000000010000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.024128, lagrangian_loss: 0.001533, attention_score_distillation_loss: 0.000193 loss: 0.032737, lagrangian_loss: 0.021743, attention_score_distillation_loss: 0.000192 ---------------------------------------------------------------------- time: 2023-07-21 07:59:15 Evaluating: accuracy: 0.8294, eval_loss: 0.5782, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 486000 lambda_1: -0.0785, lambda_2: 2736.4355 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.59 0.41 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000000000010000 11111111111111111111111110110100000000000000000000 10111111111111111111111110110101011011000000000000 10111111111111111111111110111101011101110000000000 10011111110111111011011010011101011101110000000000 00001101010110101011011010001101010101110000000000 00000000110010101000000010000000000001000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.049190, lagrangian_loss: 0.014222, attention_score_distillation_loss: 0.000192 loss: 0.017688, lagrangian_loss: 0.000107, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-21 08:09:23 Evaluating: accuracy: 0.8306, eval_loss: 0.5752, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 488000 lambda_1: -0.0537, lambda_2: 2747.4675 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.59 0.41 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001000100000000000 11111111111111111111111110110100000000000000000000 10111111111111111111111110111101011010000000000000 10111111111111111111111110111111011001110000000000 10011111110111111011011010011101011101110000000000 00000101010110101011011010011101010101110000000000 10000000010010101001000010000000000000000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.025259, lagrangian_loss: 0.001469, attention_score_distillation_loss: 0.000196 loss: 0.018083, lagrangian_loss: 0.000076, attention_score_distillation_loss: 0.000195 ---------------------------------------------------------------------- time: 2023-07-21 08:19:29 Evaluating: accuracy: 0.8317, eval_loss: 0.5713, token_prune_loc: [False, False, False, True, True, True, True, True, True, True], macs_sparsity: 0.5073, expected_sparsity: 0.4991, expected_sequence_sparsity: 0.8503, target_sparsity: 0.5, step: 490000 lambda_1: -0.1115, lambda_2: 2759.0520 lambda_3: 0.0000 train remain: [1. 1. 1. 0.62 0.56 0.64 0.71 0.58 0.41 0.15] infer remain: [1.0, 1.0, 1.0, 0.62, 0.56, 0.64, 0.7, 0.58, 0.42, 0.14] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.62, 0.35, 0.22, 0.16, 0.09, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111110111100001001000000000000 11111111111111111111111110110100000000000000000000 10111111111111111111111110110101011010000001000000 10111111111111111111111110111101011001110010000000 10011111110111111011011010011101011101110000000000 00000101010110101011011010001101010101110001000000 10000000010010101000000010000000000001000000000000 Best eval score so far: 0.8343 @ step 434000 epoch 35.37 loss: 0.018418, lagrangian_loss: 0.000035, attention_score_distillation_loss: 0.000192 ETA: 0:00:00 | Epoch 39 finished. Took 3703.98 seconds. 07/21/2023 08:25:32 - WARNING - urllib3.connectionpool - Retrying (Retry(total=4, connect=5, read=4, redirect=5, status=5)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='southcentralus.api.azureml.ms', port=443): Read timed out. (read timeout=120)")': /mlflow/v2.0/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourceGroups/gcr-singularity-octo/providers/Microsoft.MachineLearningServices/workspaces/msroctows/api/2.0/mlflow/runs/get?run_uuid=0216ae10-29c5-42b4-b544-f846658f836d&run_id=0216ae10-29c5-42b4-b544-f846658f836d