Reason for high performance may be an error in evaluation
Look at the scores, MATH for Qwen2.5-72B-Instruct is suspiciously low. It could be that your model outperforms it simply because Qwen2.5-72B-Instruct was somehow misevaluated. I have opened a discussion here: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/975
@ChuckMcSneed I 100% agree that the instruct model was evaluated wrong, but i still think my model would outperform it. Would be cool to see it re-evaluated
but i still think my model would outperform it
It depends on the score, but if Qwen got the same then in terms of averages Qwen would beat this model by about 0.5%
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/975
According to @alozowski Qwen Instruct got worse at in-context learning and tried to highlight the answers instead of following the format. What happens is your method brings it closer to base model, which can be both good(better at in-context learning, less forgetting) or bad(less obedient).
what is your finetuing dataset