I’m trying to build a somewhat complicated Docker space and I believe I’m in the home stretch, but I’m running into a bit of a snag in the longevity of the spaces.
The app (currently set the private until this gets fixed, source code below) launches several internal port listeners and uses an nginx proxy to listen on the standard 7860, using supervisord to run the nginx and server processes.
After the initial space build, it will open up fine in the browser, and everything will load and execute successfully from end-to-end. Invariably however, a SIGTERM will seemingly come out of nowhere, ending the main process (supervisord). This happens somewhere between 3 and 5 minutes everytime - see attached screenshot of one of several different attempts at various reconfigurations. To be clear, I am not setting a five-minute sleep-timer - I was trying with 30-60 minutes but even without a sleep timer at all, it will still receive this TERM within <5 minutes.
The space will then show a “Preparing Space” message in the browser, where it seems to sit endlessly until the actual sleep timer hits or I pause it.
To add to the strangeness, after this, no amount of restarts or factory rebuilds are ever able to get it to respond again. The only way to get another 3-5 minute shot at testing is to create an entire new space.
The source code for the app is available here: GitHub - painebenjamin/anachrovox: Real-time Audio AI Chat with a Retro Vibe. I’m pulling from the built package repository there (i.e. my dockerfile is just FROM ghcr.io/painebenjamin/anachrovox:latest). I’ve tried several of my own machines of varying hardware configurations and haven’t had the TERM happen so I think it’s likely originating from the Spaces backend, but I could be wrong. My hunch is that the networking setup is maybe not accounting for some kind of heartbeat signal so the Spaces backend doesn’t realize it’s awake? Again, just a hunch.
I’d very much appreciate any insights into what might be going on here, or how I could maybe get more information. I’ve been able to grab my app’s and nginx’s logs and neither of those show anything out of the ordinary, so unfortunately I don’t think any additional logging from within the container will help me get any closer to a fix at this point.
If you use some of the network-related libraries that are prohibited by HF (I’m not really familiar with the details…), it may become unusable in a few minutes, but it’s not clear whether that’s the case or not.
@not-lain do you know anything about this kind of symptom?
Hey @John6666! I hoped you might show up in this thread
For a bit more information, I did encounter the “permanently building” issue multiple time, which I found information about in this thread (also featuring you: Space is Building... permanently). If I make a barebones proof of concept app featuring nginx, supervisord and uvicorn, it refuses to build. This example space I made public: Nginx Supervisord - a Hugging Face Space by benjamin-paine. To be clear, while this seems related to the initial issue, it may not be.
If using nginx as a proxy for internal connections is disallowed, I suppose I could understand, but HF themselves recommend using it in their documentation (I can’t add any more links - it’s at huggingface DOT co/docs/hub/en/spaces-sdks-docker) and it’s one of the easier tools to set up for adapting apps that run on multiple ports to a single-port space. It’d be a huge shame if we’re forced into using some esoteric-but-unflagged python package for proxying (or even worse, rolling our own) because of collateral damage from fighting bad actors.
If you want to expose apps served on multiple ports to the outside world, a workaround is to use a reverse proxy like Nginx to dispatch requests from the broader internet (on a single port) to different internal ports.
With a site that has this much server resources, there are also a lot of malicious companies that will come flocking…
I’ve seen some serious scams mentioned on Discord. I think that countermeasures themselves are unavoidable.
However, it’s not reflected in the documentation and it’s simply inconvenient for us…
I’ll try reporting it on HF Discord.
A long-ish update I’m copying from the Discord conversation you and I had so there’s visibility here:
The thing that’s stopping the build isn’t nginx - it’s supervisord. A barebones test installing nginx and serving local files does NOT result in the build errors.
I re-tooled the application to use a simple shell script for process monitoring instead of Supervisord, and additionally enabled dev mode on the space so I could do some more debugging. The issue still occurs in dev mode. I was able to determine that the TERM is indeed going to the parent process, and since the parent process in dev mode is a wrapper that runs my actual process, I get a different log message indicating it is indeed being terminated.
At the moment, I’ve managed to dance around the overall issue by being resilient to the restart - basically assume that the main entry point will be prematurely terminated, and act accordingly. Here is a log of what that looks like:
So I’m no longer blocked - yay! The initial symptom persists though, so for future adventurer who randomly encounters this, this solution may work for you too.