How-To Build a REST Implementation that Scales Better than SMPP

Published March 25, 2016 by Paul Cook

There is myth in the wholesale SMS industry that you can only achieve scale with SMPP. Here are a few tips on how to achieve scale with Nexmo REST API as an alternative to SMPP. By applying these principles, it’s possible to build a REST integration that rivals a well-constructed SMPP integration.

  • Persistence: Use HTTP1.1 over HTTP1.0. Make use of connection keep-alive, avoiding the overhead of socket establishment round trips on each request.
  • Parallelism: Use an executor/worker-thread style design pattern to ensure that multiple requests are happening at once. One of the biggest scaling pitfalls of making REST requests is running your dispatching in a single thread and keeping everything serial. Individual HTTP requests have a degree of latency in them, in terms of the underlying TCP round-trip time to get the data to the service and retrieve the response, and the nature of HTTP being a serial question-and-response type protocol. Using multiple concurrent requests will help absorb this latency. Think about TCP sliding windows, with its notion of many packets in-flight waiting to be acknowledged. If the TCP stack in our machines did not do this and sent each packet one by one to wait for a response, the whole internet would immediately grind to a halt.
  • Throttle: It can seem counter-intuitive that in order to go fast, it is necessary to throttle, but consider this: if you blast away transmitting data as fast as you possibly can, you will quickly exceed either the capacity of the other end to respond and keep up, or exhaust the capacity of the pipe in between to move all of this data and keep up. When this happens, this will have one of a number of effects.

    The TCP stack will lose packets that will need to be re-submitted. This causes a delay as you wait for timeouts to occur to trigger this. Also, this generates even more traffic on the pipe that simply compounds the issue.

    Or, your app may be receiving throttle nacks, or no response at all, which, depending on your application logic, may be triggering a back-off wait and retry mechanism (for example, if your requests are generated as the results of messages in a queueing framework). This causes delays waiting for timeouts, and additional traffic to be generated.

    The cumulative effect of these delays can in fact cause the actual throughput to be far lower than if you implemented a throttle and submitted slower at a steady rate that is within the capacity of the receiving system, and of the pipe in-between.So, the headline here is, use throttling to ensure you are not submitting faster than the agreed throughput that has been agreed with the provider of the api you are communicating with.
  • Low-latency responses: When you implement a service that is receiving callbacks for incoming messages and you are expecting to handle a large volume of requests in a short amount of time, it is vital that the receipt of these requests is acknowledged as quickly as possible.

    It is likely that upon receiving a message callback, your system will have a number of things to do. It might want to log the request in a database, or it might update some totals against an account. It may need to execute some complex business logic to perform some actions on one of your user’s accounts as a result of the contents of the message.

    Any of these activities may be a lengthy process and rely on external resources such as database servers, locks on database tables or making further web service calls to other services to perform further actions.

    This can all add up to an action that takes a very long time to execute, from hundreds of milliseconds to seconds. This can quickly reduce the potential throughput of receiving these callbacks to a very small number of requests per second.

    A common pitfall is to follow the anti-pattern where the callback request is received, which immediately executes a sequence of events similar to that described above. You can potentially wait on a shared lock, make remote web service calls, or wait for slow round-trip access to a database. Only when these actions are complete does the system then acknowledge receipt of the callback message. This sort of pattern does not scale beyond a very small volume of traffic.

    The service performing the callback requests to your application (in this case, the Nexmo messaging platform) will be operating its own throttling and flow control mechanisms.  If you take a long time to acknowledge messages, Nexmo will send you more requests in parallel. But this parallelism is limited. Only a certain number of requests will be made until the acknowledgments of at least some of those requests have been received. Again, think TCP sliding windows.

    Thus, taking a particularly bad example where the request take 5 seconds to respond, if you have a large number of incoming messages, you will quickly receive 5 requests, then receive nothing until those 5 requests are acknowledged (causing effectively a 5 second stall in the traffic flow), and each time this window is fully used up, the traffic flow will stall again.

    In order to receive high volumes of messages rapidly, it is vital that these requests are acknowledged quickly. Common patterns and frameworks are widely available in order to detach the receipt and acknowledgement of a request from its execution. At a basic level, Java developers have the executor framework available to dispatch the request to a separate execution thread. Similar frameworks are available for all common languages.

    At an enterprise level, consider making use of a queueing framework to detatch the handling of an incoming request from the acknowledgement. This will allow you to accept a large volume of requests quickly, and deal with them as fast as your application can.
  • Load-balance: As the volume of requests grows, it is inevitable that in time, the hardware requirements of your infrastructure will increase in order to serve all of the requests in a timely manner.

    Typically, this will involve installing a number of instances of your application sitting behind a load balancer. Incoming end user web requests are spread across your farm of servers using a variety of algorithms (maybe round-robin or random), weighted differently for different servers, with sticky or floating sessions depending on the individual needs of your application.

    The same is true for incoming web service requests. Your farm of servers sits behind the load balancer’s virtual address and each receives a subset of the requests. As the volume increases, you add more servers to handle more of the load.

    This is, of course, painting a simplified picture of the scenario. There will be other factors your application needs to consider to scale such as database capacity, lock contention or shared cache infrastructure. These are all way outside the scope of this article. The overriding principle of load balancing of the requests remains true though. 
  • Advantages of disconnected HTTP verses connected SMPP. One of the key advantages of a REST based approach over a more traditional approach of using a persistently connected socket and communicating with a protocol such as SMPP is that SMPP is a highly scalable telecoms protocol, but brings with it a number of challenges including scaling across multiple servers.Persistent sockets protocols, by their nature, are heavyweight beasts and an allocation of available binds is usually a restricted resource. It would be rare to encounter a provider who will allow you to open a large number of sockets from a large number of servers and keep them open permanently. This means, that as your number of servers grows, it may not be possible to have them all take part in the receiving of message requests. This can be a serious roadblock to being able to scale your infrastructure quickly to meet changing demands.Additionally, the principle of such an integration binds your socket to the destination service, and waits for requests to arrive. These are direct point-to-point connections that your application must establish. There is no automatic means of switching that connection between, say, a master and backup server in the event of failure, or shaping the flow of traffic so that more requests are sent to your servers with more capacity than the servers that are busy. This is completely out of your control.

With a REST implementation, there is no persistent socket to establish, no overhead in maintaining that socket, performing keep-alive cycles or managing the lifecycle of establishing and closing down the connection. Instead, you receive a stream of requests that you can shape and direct according to your individual requirements.

If you have a server that is twice as powerful as some of your older kits, you can set up a load balance policy to send more requests to that server. If you have a server you want to keep as a warm standby, you can do so, and only start sending it requests when circumstances dictate. The important distinction here is, you are in control, and are free to grow and scale your infrastructure in whatever ways are required.

Leave a Reply

Your email address will not be published.

Get the latest posts from Nexmo’s next-generation communications blog delivered to your inbox.

By signing up to our communications blog, you accept our privacy policy , which sets out how we use your data and the rights you have in respect of your data. You can opt out of receiving our updates by clicking the unsubscribe link in the email or by emailing us at