stabilityai/stable-diffusion-xl-1.0-tensorrt · hanging while building engine for `unetxl.trt8.6.1.plan`

Things seem to start well, but I keep getting stuck building the engine for unetxl.trt8.6.1.plan. I've left it running for about an hour.

[I] Initializing TensorRT accelerated StableDiffusionXL txt2img pipeline
[I] Load tokenizer pytorch model from: pytorch_model/xl-1.0/XL_BASE/tokenizer
[I] Load tokenizer pytorch model from: pytorch_model/xl-1.0/XL_BASE/tokenizer_2
[I] Load VAE decoder pytorch model from: pytorch_model/xl-1.0/XL_BASE/vae
Building TensorRT engine for /stable-diffusion-xl-1.0-tensorrt/sdxl-1.0-base/clip.opt/model.onnx: engine/clip.trt8.6.1.plan
[W] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
Updating network outputs to ['text_embeddings', 'hidden_states']
[I]     Configuring with profiles: [Profile().add('input_ids', min=(1, 77), opt=(1, 77), max=(1, 77))]
[I] Building engine with configuration:
    Flags                  | [FP16]
    Engine Capability      | EngineCapability.DEFAULT
    Memory Pools           | [WORKSPACE: 40370.00 MiB, TACTIC_DRAM: 40370.00 MiB]
    Tactic Sources         | []
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
[W] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[W] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[W] Check verbose logs for the list of affected weights.
[W] - 112 weights are affected by this issue: Detected subnormal FP16 values.
[W] - 1 weights are affected by this issue: Detected finite FP32 values which would overflow in FP16 and converted them tothe closest finite FP16 value.
[I] Finished engine building in 49.294 seconds
[I] Saving engine to engine/clip.trt8.6.1.plan
Building TensorRT engine for /stable-diffusion-xl-1.0-tensorrt/sdxl-1.0-base/clip2.opt/model.onnx: engine/clip2.trt8.6.1.plan
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1517189726
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1517189726
[W] onnx2trt_utils.cpp:70: TensorRT is using FLOAT32 precision to run an INT32 ArgMax / ArgMin. Rounding errors may occur for large integer values
[W] Tensor DataType is determined at build time for tensors not marked as input or output.
Updating network outputs to ['text_embeddings', 'hidden_states']
[I]     Configuring with profiles: [Profile().add('input_ids', min=(1, 77), opt=(1, 77), max=(1, 77))]
[I] Building engine with configuration:
    Flags                  | [FP16]
    Engine Capability      | EngineCapability.DEFAULT
    Memory Pools           | [WORKSPACE: 40370.00 MiB, TACTIC_DRAM: 40370.00 MiB]
    Tactic Sources         | []
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
[W] - 324 weights are affected by this issue: Detected subnormal FP16 values.
[I] Finished engine building in 169.366 seconds
[I] Saving engine to engine/clip2.trt8.6.1.plan
Building TensorRT engine for /stable-diffusion-xl-1.0-tensorrt/sdxl-1.0-base/unetxl.opt/model.onnx: engine/unetxl.trt8.6.1.plan
[W] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[I]     Configuring with profiles: [Profile().add('sample', min=(2, 4, 128, 128), opt=(2, 4, 128, 128), max=(2, 4, 128, 128)).add('encoder_hidden_states', min=(2, 77, 2048), opt=(2, 77, 2048), max=(2, 77, 2048)).add('text_embeds', min=(2, 1280),opt=(2, 1280), max=(2, 1280)).add('time_ids', min=(2, 6), opt=(2, 6), max=(2, 6)).add('timestep', min=[1], opt=[1], max=[1])]
[I] Building engine with configuration:
    Flags                  | [FP16]
    Engine Capability      | EngineCapability.DEFAULT
    Memory Pools           | [WORKSPACE: 40370.00 MiB, TACTIC_DRAM: 40370.00 MiB]
    Tactic Sources         | []
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]

Running on a machine with an A100

root@modal:~# nvidia-smi
Thu Sep  7 21:25:57 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   36C    P0    58W / 400W |    819MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

And Pytorch 2.1.0, using nvidia's 23.07 image with python 3

=============
== PyTorch ==
=============

NVIDIA Release 23.07 (build 63867923)
PyTorch Version 2.1.0a0+b5021ba

I've tried both the 8.6 and 9.0 branches on https://github.com/rajeevsrao/TensorRT, and same result. Any idea what may be going wrong?