Update readme with infinity
#17
by
michaelfeil
- opened
Should be faster than SentenceTransformers, as it uses a10 nested backend of pytorch.
docker run --gpus all -p "7997":"7997" michaelf34/infinity:0.0.70 v2 --model-id Snowflake/snowflake-arctic-embed-m --dtype float16 --batch-size 32 --engine torch --port 7997
Status: Downloaded newer image for michaelf34/infinity:0.0.70
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO 2024-11-26 17:33:29,015 infinity_emb INFO: infinity_server.py:92
Creating 1engines:
engines=['Snowflake/snowflake-arctic-embed-m']
INFO 2024-11-26 17:33:29,026 infinity_emb INFO: select_model.py:64
model=`Snowflake/snowflake-arctic-embed-m` selected,
using engine=`torch` and device=`None`
INFO 2024-11-26 17:33:29,354 SentenceTransformer.py:216
sentence_transformers.SentenceTransformer
INFO: Load pretrained SentenceTransformer:
Snowflake/snowflake-arctic-embed-m
INFO 2024-11-26 17:33:40,276 SentenceTransformer.py:355
sentence_transformers.SentenceTransformer
INFO: 1 prompts are loaded, with the keys:
['query']
INFO 2024-11-26 17:33:40,293 infinity_emb INFO: Adding acceleration.py:56
optimizations via Huggingface optimum.
The class `optimum.bettertransformers.transformation.BetterTransformer` is deprecated and will be removed in a future release.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
/app/.venv/lib/python3.10/site-packages/optimum/bettertransformer/models/encoder_models.py:301: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
hidden_states = torch._nested_tensor_from_mask(hidden_states, ~attention_mask)
INFO 2024-11-26 17:33:40,716 infinity_emb INFO: Getting select_model.py:97
timings for batch_size=32 and avg tokens per
sentence=1
2.66 ms tokenization
5.58 ms inference
0.12 ms post-processing
8.36 ms total
embeddings/sec: 3828.62
INFO 2024-11-26 17:33:41,058 infinity_emb INFO: Getting select_model.py:103
timings for batch_size=32 and avg tokens per
sentence=512
14.66 ms tokenization
122.97 ms inference
0.14 ms post-processing
137.78 ms total
embeddings/sec: 232.26
INFO 2024-11-26 17:33:41,060 infinity_emb INFO: model select_model.py:104
warmed up, between 232.26-3828.62 embeddings/sec at
batch_size=32
INFO 2024-11-26 17:33:41,062 infinity_emb INFO: batch_handler.py:443
creating batching engine
INFO 2024-11-26 17:33:41,063 infinity_emb INFO: ready batch_handler.py:512
to batch requests.
INFO 2024-11-26 17:33:41,066 infinity_emb INFO: infinity_server.py:106
♾️ Infinity - Embedding Inference Server
MIT License; Copyright (c) 2023-now Michael Feil
Version 0.0.70
Open the Docs via Swagger UI:
http://0.0.0.0:7997/docs
Access all deployed models via 'GET':
curl http://0.0.0.0:7997/models
Visit the docs for more information:
https://michaelfeil.github.io/infinity
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:7997 (Press CTRL+C to quit)
spacemanidol
changed pull request status to
merged
Thanks!