Regarding Concerns about MMLU Scores

#5
by deleted - opened
deleted

There's a tight correlation between MMLU (diverse knowledge) and parameter count because you need somewhere to store the additional information, which in the case of LLMs, requires increasing the parameter count.

For example, this is why LLM families like Llama 2 (7b, 13b, 33b & 70b) see a smooth and predictable MMLU gradient to parameter count that's unseen in nearly all other tests, such as WinoGrande.

Anyways, it's easy to get a feel for the true MMLU score of any LLM by asking a series of fringe esoteric questions, and Yi-34 dense is no more knowledgeable than Mixtral 48 sparse (~70 MMLU for both). Yi-34's 77 MMLU score is without a doubt due to test contamination in their foundational model. And after testing this LLM it got no more fringe esoteric questions right, so it too has an MMLU of ~70.

If you were going to deliberately cheat you obviously wouldn't add it all to MMLU so something odd happened to boost the MMLU from 70 to 85.6. Perhaps since Yi-34 is already highly contaminated with MMLU data (7 boost), you did something that brought out even more of said contamination.

Regardless, the only way for a modern dense LLM (e.g. Mistral or Yi) to achieve an MMLU score of 85 is to have ~270 billion parameters (5 point MMLU increase for every doubling of parameters). It's simply not theoretically possible to store that much information in a current generation 34b dense LLM.

You are making groundless accusations. Especially when you mention Yi-34 dense compared to Mixtral, I think the former has a larger activation param. At least at this point, I hope you won't make inappropriate remarks about other models here.

But I need to emphasize that the MMLU score does not mean anything special, nor is it exactly a parameter count, but is more data-oriented - which is consistent with our approach. In an updated version, we used more than 50M pieces of data that were crawled from the web and comprehensively rewritten using multiple web pages and large context LLM, of which more than 10M pieces were the result of GPT-4 level models.

Of course, the model in this repo is not the final version. However, the changes in MMLU scores can be reflected quickly - this indirectly proves the theory that LLM is a compression model. By simulating the compression results of larger models on a wider range of data, we can obtain simulation improvements in compression capabilities.

JosephusCheung changed discussion status to closed

Contamination detection on MMLU is also provided in the repo, which is at a level of safety. The only potential concern arises from not actively filtering the content of the test set, from web crawler data.

In fact, you are not the first person to make this statement, but it makes me lose confidence in releasing the whole dataset (I have released a 1M+ subset), and the new models. I spent more than any synthetic datasets you can see here, dozens of times larger than the openhermes dataset, and what I got was just subjective conjecture.

deleted

@JosephusCheung Contamination testing is very unreliable, which is partly why people all but stopped flagging models on HF. And with a dataset as large as OH (~1M), and you claiming a much larger dataset, there's no way to avoid notable contamination. Plus part of why it's unreliable is it works even when the data is reworded and stored in a different language, in this case that most likely Chinese. This is almost certainly why Yi-34 has an MMLU score of 77, yet a true score of around 70.

I'm honestly not making accusations of deliberate contamination, but I am saying with 100% certainly the true broad knowledge of this LLM (MMLU score) is around 70 and nowhere near 85.

And if it was people would be dying to figure out what you did. Of the 1000s of fine-tunes, not just of Yi-34, but all other models, yours is one of only a few with a notable MMLU bump.

Finally, I read a lot of papers on this, including information provided by Meta, MistralAI and other makers of foundational models. And MMLU is not something you can ever notably change with fine-tuning. It's locked to the parameter count of a given technology. For example, by using far more sophisticated data filtering Mistral was able to increase MMLU by ~5 points per parameter count over LLama 2's internet dump approach. And Microsoft, by using nothing but low redundancy text book quality data, plus avoiding data like pop-culture that's largely excluded from the MMLU, was able to add another ~5 points to MMLU per parameter count in Phi-2. This is currently the limit.

Take a look at the test questions. Nearly all don't require a high IQ or advanced language skills to get right. They primarily just require extensive knowledge of various fields of study. And a 5 point gain on MMLU requires about twice the world knowledge, hence the required doubling of parameter count. Again, Yi-34 has a true MMLU score of 70 (their score of 77 is due to contamination). And I tested this LLM and it has no more knowledge than any other Yi, so it to has an MMLU score of around 70 as well. Something other than an increase in knowledge/data is accounting for the score of 85. And like I said, I suspect you stumbled on a way to bring the contamination that's already in Yi out (likely buried in non-English languages like Chinese, making traditional contamination testing all but useless).

Please consult to this experimental model: https://huggingface.co/itsliupeng/llama2_7b_mmlu

MMLU of llama2 7B got 46.87->60.04, on such a small dataset.

As I mentioned in other answers, this is largely a matter of cumulative effects and data set bias. Although I did not use such extreme reverse retrieval data, instead, I use web crawlers of my own, I do think that unintentional contamination may exist - but I think it is within a reasonable range, see the contamination test.

As for your other remarks, for example, you claim that Yi's score is much higher than it actually is, I hope you will not make such irresponsible remarks, and I also do not want you to criticize other people's work here. In fact, these are all conjectures based on your subjective experience, and I think the performance of the model should be consistent on new test data that is isomorphic to MMLU but has different content. This was also confirmed in my communication with Yi developers.

And I think it is difficult to get the same results in a direct cheating training without damaging general performance of the model. Even if you train directly on MMLU answers, you may not be able to get such a MMLU score, not to say with the model working correctly on other tasks. However, I still do not believe the increase in MMLU score actually translates to improved performance on downstream tasks, and that I believe we are still far from OpenAI GPT-4, and that any victories on narrow subdomains are not very meaningful. As such, I'm hesitant to tout this high MMLU score as I don't see any tangible benefits it brings, but rather present evidence of non-subjective contamination to avoid unfounded accusations and debates.

deleted

@JosephusCheung I have more than my fair share of blind spots, but what I do know is this model, including Yi, didn't perform any better on esoteric knowledge than other models like Mixtral that scored 70 on MMLU. And since even superhuman IQ and language skills can't bring the score up to 85 on a test like MMLU of expertise knowledge across various domains there's simply no way to notably increase MMLU with traditional fine-tuning. Perhaps there's a technique that can access the data better (e.g. Laser), but none of those techniques so far made any real difference.

MMLU is currently baked into foundational models. You simply ain't going to boost it with fine-tuning, at least not with any fine-tuning techniques currently available.

And it's not that Yi cheated to gain 7 points. It takes a careful data preening effort across all languages to remove contamination from the corpus, and they clearly didn't do a good job.

  1. Public backlash due to failure to proactively filter potential contamination.

If you truly want this resolved you can just be fully transparent so this can be determined independently. If you are worried that it's contaminated you should probably have said so from the beginning, and on the model card.

These are just suggestions. I'm not your boss.

If you truly want this resolved you can just be fully transparent so this can be determined independently. If you are worried that it's contaminated you should probably have said so from the beginning, and on the model card.

If you have ever attempted to create a synthetic dataset, you would understand the challenges involved. Verbatim repetition is obviously impossible, but rewrites caused by web text and content generated by GPT-series models during synthesis are beyond my control. Detecting such issues is no less difficult than synthesizing the data itself. You are asking a bit much form me.
Therefore, I have provided the model's own contamination detection result, which I believe is safe.

In other words, I think the level of contamination - similar to other common pre-training datasets - is acceptable.

If your model has a SOTA result -- particularly by such a leap as yours -- you are going to get scrutiny! It's a good thing.

I don't think everyone cares only about fame - If I were the author of MetaMathQA, considering the impact on GSM8K, I would never have released that data.

rewrites caused by web text and content generated by GPT-series models during synthesis are beyond my control. Detecting such issues is no less difficult than synthesizing the data itself. You are asking a bit much form me.

Did you log the content that was fed into the synthesis pipeline?

@JosephusCheung i've given a few LLMs a shot and right now, CausalLM-34B is the one that fits my needs best. APP users aren't too concerned about MMLU scores, they just want to know if a model performs well for them personally. And for now, CausalLM-34B has been quite effective in my experience.

Frankly i truly hope you keep refining this model, because as more people use it in real applications, an ecosystem will naturally form around it, which is essential.

Also, i'd advise against making training datasets easily accessible. You know, APP users can fine-tune with LoRA or similar methods to accomplish tasks without needing open data sets, maybe most requests for data are just driven by competition...

I don't think anyone needs my take on this discussion, but this is the internet, so whatever i guess.

I think it's important to take all forms of skepticism and concern seriously when it comes to fine tuning LLMs, due to the enormous amount of training data involved.
And I don't think it is reasonable to respond to that criticism with aggression and toxicity.

And I think it's its important to remember that all that criticism just comes from the fact that we don't want people to waste their time and money on LLMs that cheat benchmarks, but perform worse in real use. @JosephusCheung and @ehartford might just end up wasting money and resources without responding to these concerns, and so will many others deceited by these benchmarks if we don't take user feedback as serious, if not more, than benchmarks. And as a Free and and Open Source enthusiast, releasing more training data, or the exact data generation pipeline and the source data that takes is the best way to respond, to allow open research and to let us present our findings, and maybe potential solutions.

PS: pinging @Phil337 to make sure he knows that JosephusCheung didn't show him the full dataset, see the comments above. That still means MMLU test questions could have end up in the final dataset.

deleted

@nlpguy I'm with you. The best path forward is for all methods and training data to made public.

The counterargument is very strong. It's unfair to spend a large amount of time, effort and money curating a quality set of synthetic data, and then to have others immediately use it, depriving you of the uniqueness all that cost and effort brought.

But when things are done behind closed doors contamination inevitably creeps in, because without a large number of eyes scanning the data it's all but impossible to ensure clean data, as JosephusCheung pointed out earlier.

Perhaps a delayed release of all methods and data within a month or two would be a good compromise.

And frankly, I keep making things worse because of my harsh personality, and even had to delete about a couple dozen comments after crossing the line. This is part of the reason why others like ehartford are quick to anger with me. It's not just because they object to constructive criticism and feedback.

After a discussion with a companion today, I arrived at a simple solution: we just need to find a way to lower the MMLU score to a level that no one will object to, for example, by forcing output in CoT format for all responses instead of directly giving answers from time to time, thereby reducing the false positive accuracy.

We are all sinners, Happy Easter.

JosephusCheung changed discussion status to closed
JosephusCheung unpinned discussion

Come on now!, why end the discussion in such bad faith :(

You told Phil that you understand that he is not acting out of malice, and I'm not even sure you read his response.

Perhaps a delayed release of all methods and data within a month or two would be a good compromise.

Doesn't that sound reasonable, at least the idea if not the timeframe?

JosephusCheung locked this discussion
JosephusCheung pinned discussion
JosephusCheung unlocked this discussion

Sign up or log in to comment