Is the tokenization and mean pooling method described in the documentation here performing "late chunking" as it is described in your paper/library "late-chunking"?

#108
by imambujang - opened

Looking at the manual mean pooling method described in your documentation for this model and the method for late-chunking in your repo here: https://github.com/jina-ai/late-chunking/tree/main, does the mean pooling method described in this model's documentation basically achieve late chunking? They look similar and I am not 100% sure on the differences between both approaches.

If they do accomplish the same thing, then in essence we can use our own chunks (more simply) with late chunking by tokenizing our chunks directly with the mean pooling method here instead of having to derive the span_annotations for our custom chunks (as used in late-chunking) manually right? Thank you!

The mean pooling described in the model card here produces one embedding for the whole output. So you need to adjust it to apply mean pooling on subsets of the output embeddings that correspond to your chunking. Those subsets are usually defined by the span_annotations. The important point is that you apply the model to all tokens at ones an then chunk just before the mean pooling. Does this answer your question?

Gotcha, so the main difference is that late chunking involves splitting a full text into smaller chunks after the transformer model processes the entire text, but just before mean pooling. Thus ensuring that the embeddings capture the full contextual information from the full text. In contrast, the code on this model card processes each individual sentence separately and then applies mean pooling to each sentence independently, but not within the full context.

Sign up or log in to comment