Great talk. I think the TL;DR is - Don't use rate limits - Capacity *planning* is hard. Make it dynamic. - Dynamic Capacity Planning can be done with AIMD en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease - It's OK to tell your clients you're overloaded, they are the ones who are obliged to respect back pressure.
You're still rate limiting; you're just pushing that job onto the clients. But how will they know how long to wait after receiving a NOPE before trying again? If they try again to quickly, they'll just keep getting NOPEd. If they try again too slowly, there may be stranded capacity again. Instead, what if we simply sent back a "wait this long before your next request" header with every response? The wait period could be zero if the server is below capacity, but if it's at capacity we calculate a conservative estimate in milliseconds for how long they should wait - may be different every time - before making the next request. Simply compare how many requests we completed last second to how many we get done this second and assume the demand will be the same for the future second - then divide the next second up fairly among all the clients. Clients who we've never seen before will get priority, as they have no known wait time to go by, and we should give them VIP treatment anyway over the clients who have been hammering us for a while with no sign of stopping. Clients who disrespect the wait time get deprioritized, or even noped. I feel like this would maintain a constant near-100% pressure, and yet clients also know exactly what to expect - if they respect the wait time, they're guaranteed a quick response and no NOPE. If they see a wait value that's too high, they can choose to write the server off as too congested and give up for now, if they want. This just leaves even more capacity for the rest of the clients. Same happens when you gave a client a wait time but they have no more to send. Some capacity goes unused, and you can account for this in the next second's measurement.
Sounds helpful for the server to proxy and proxy to client to use a back channel signal to advise a 'back off', and then a response (dropping/refusing requests) if it does not.
One thing I'd like to see is how AIMD is configured. How do you decide backoff factor in multiplicative decrease and additive factor in Additive increase?
instead of using `number of concurrent users` , should'nt we be using the `time taken to serve a request` as deciding factor for increasing/decreasing incoming requests ?