Models that used Nectar dataset

#749
by Stark2008 - opened

According to this, the Nectar dataset is contaminated.

The following models are made of base models that used that dataset:
StarMonarch-7B - one of the base models used Nectar.
StrangeMerges_45-7B-dare_ties - one of the base models used Nectar.

CarbonBeagle-11B > NeuralBeagle-11B > franken-Beagle-11B > NeuralBeagle14-7B > distilabeled-Marcoro14-7B-slerp > Marcoro14-7B-slerp > Mistral-7B-Merge-14-v0.1 > Starling-LM-7B-alpha (fine-tuned on Nectar)

Open LLM Leaderboard org

Hi @Stark2008 ,

Thanks for pointing this out about the Nectar dataset contamination!

Wasn't there a cleaned version of Nectar released after the contamination was discovered? That might help address some of the concerns here. Plus, I think it could be really helpful to hear from the authors of the models you mentioned. They might have more insights on this.

About the chain of models you mentioned – this looks like a sequence where each model builds on the previous one. To keep it simple, we could start by reaching out to the author of the final model in this chain, Starling-LM-7B-alpha.

I'll tag the authors of the models to get their thoughts:

What do you all think about the usage of the Nectar dataset and whether a cleaned version was used?

Thanks a lot!

I'm not one that really cares. If it's contaminated, feel free to flag it.

deleted
This comment has been hidden

In short, contamination or not, the flood of Mistrals are overfitting the tests, making their average leaderboard scores meaningless (off by up to 10 points). And singling out a small percentage of them for contamination is a waste of time.

Isn't the overfitting of the tests by those models the result of contamination? How can you tell for certain which models' score is legit if you don't flag the ones that are contaminated?

deleted

@Stark2008 I'm a non-technical user who heard others use the term overfitting so I'm parroting it.

One possible example of overfitting vs contamination is making an LLM stubborn. Doing so makes the LLM perform worse overall, such as stubbornly sticking to absurd hallucinations, but at the same time limits getting tricked into sharing falsehoods like the earth is flat, boosting TruthfulQA. Sure enough ALL Mistrals with very high TruthfulQAs are stubborn asses.

But since you're right that contamination is also happening and should be addressed I'm hiding my original comment because it's off topic.

Wasn't there a cleaned version of Nectar released after the contamination was discovered?

@alozowski
I tried looking into it, couldn't find anything suggesting that. What makes you think there was?

P.S
Another chain of models using berkeley-nest/Starling-LM-7B-alpha:
Kukedlc/NeuralMaths-Experiment-7b > mlabonne/NeuralDaredevil-7B > mlabonne/Daredevil-7B > EmbeddedLLM/Mistral-7B-Merge-14-v0.2 > EmbeddedLLM/Mistral-7B-Merge-14-v0 > berkeley-nest/Starling-LM-7B-alpha

Obviously the process of finding lots of them can be easily automated... Let me know if you want me to keep posting them...

Thank you for raising the concerns. Nectar is created mainly for improving human preferences with RLHF, where we do not SFT the model to imitate the responses but rather let the model explore better responses byitself. As a result, we are mainly looking at the human preference evaluation on Chatbot Arena https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard. These are usually fresher prompts and shall reflect better how the models perform in real chat environment.

That being said, it is very likely that some prompts in Nectar coincide with some of the test prompt here. If someone has the bandwidth to create a cleaned-up dataset I'd be very happy to see.

TBH, given the current landscape, I won't be surprised if most of the SFT dataset / SFT model on HF have already been contaminated to some extent with existing static test prompts on openllm leaderboard. That's why I personally prefer looking at vibe test / more dynmamic questions like those on Chatbot Arena.

Open LLM Leaderboard org

Hi @banghua , thanks for your comment!
Just to be sure, you say in the dataset card of Nectar that you are using UltraFeedback, and UltraFeedback uses prompts from TruthfulQA - do you use the gold values at any point, or only the prompt to generate preference data? If the latter, I think it's a problem of accidental contamination (if some of the preference data includes the correct answer), and it would be great to have a cleaned version of the dataset.

I agree that it's likely a lot of the SFT datasets accidentally contain static test set prompts, but we're still trying to make the open llm leaderboard a fair resource by judging all models in the same setup, which means flagging such contamination when it occurs and can change the scores.

Regarding vibe checks/crowdsourced human evaluations, there has been a number of super interesting studies at ICLR this year on the fact that they are good for correlation with human preference, but not for actual model quality: humans tend to prefer models which are assertive even when they say bullshit, which align with their personal values, and which exhibit sycophantic behaviors.
So it's a good measure if you want to build a friendly chatbot, not a good one if you are actually concerned with anything like factuality.

( @Stark2008 re cleaned Nectar, I was discussing it with Alina, and was sure one was already created from a previous discussion, but I'm not finding it ^^ - I must have hallucinated it)

Hey @clefourrier ,

Thanks for the response.
So as far as we can tell the only Nectar existing today is contaminated?

clefourrier changed discussion status to closed

Hey @clefourrier ,

Why did you close it?

Open LLM Leaderboard org

Hi! I had answered your question with a thumbsup, and it's then been inactive for 3 weeks

Hi! I had answered your question with a thumbsup, and it's then been inactive for 3 weeks

The original intention of the discussion was about flagging models that used Nectar, which I don't feel like we concluded, but okay...

Open LLM Leaderboard org

The first leaderboard is now closed so this conversation is not as relevant, since we changed all the benchmarks (also to avoid contamination issues) - we'll just need to add the flags for the 3 models above on the archived version, which I will do today.

Sign up or log in to comment