Text Generation
Transformers
Safetensors
dbrx
conversational
text-generation-inference

VRAM Requirements?

#39
by dounykim - opened

I checked the document and found out that at least 320GB of memory and a minimum of 4 x 80GB multi-GPU system is required is too run inference with 16-bit precision.
Does this mean I need 320 GB Disk Space and 320 GB VRAM?

Databricks org

Yes, to load all 132B params in 16-bit you need at least 264GB of RAM and somewhat more for headroom for inference, etc.

Is there any data on how quantized model versions work? Some benchmarks or just general opinions would be welcome :)

The model card states “The model requires ~264GB of RAM”. I have 440GB RAM in a databricks cluster but the download halts after about 70% with error message stating no more disk space available. What am I missing?

Databricks org

Disk space != RAM. That has nothing to do with RAM.

First, you need the RAM on one machine, not across machines across a cluster. This does not distribute in this way. You want one big single-node 'cluster' in Databricks. But this is not the issue.

I assume you're just letting Hugging Face download and cache the model. By default it saves copies of the model files in a location under ~/.cache. In Databricks, the root volume of machines is not large (100GB? IIRC?) because you generally don't directly use significant disk space in workloads. However, the attached autoscaling local storage is very large, and gets bigger, as that is what Spark itself uses for local storage. That's the storage mounted under /local_disk0 (at least on AWS, think it's the same on Azure). You can tell HF to use a path under there to cache by setting the env variable HF_HUB_CACHE to such a path, in your notebook, before you import transformers.

That's still transient storage. If you want to avoid downloading from HF every time, you can instead use a path on persistent distributed storage as the cache. This would be DBFS or Unity Catalog volumes; /dbfs/... or /Volumes/... paths. Same idea, you set HF_HUB_CACHE to such a path. The upside is that the cache is persistent; the downside is that it's still not locally attached storage, so the initial upload is slower, and not as fast to load from local disk later because it's copying from object storage. Still may be faster and more reliable than HF downloads, and saves HF's servers some load!

Will there be api available? Or is using Databricks to create a Workspace the most economical option for those lacking computing power?

Databricks org

You can experiment with the model here: https://huggingface.co/spaces/databricks/dbrx-instruct
For large-scale production use, you would want to run it yourself, or indeed use a third-party model hosting service, and Databricks is one of those.

Sign up or log in to comment