Sparsity in mixtral
#137
by
dpk17
- opened
What are the sparse weights in mixtral? I looked at the intermediate layer which has matrices of size [14336, 4096] and counted number of non-zeroes using torch.count_nonzero(x)
. I did this by counting nonzeroes in the weights in the forward layer of the intermediate layer. All the entries in the matrix were non-zero. I am wondering what exact weights in the model are actually sparse.