Why do IoT Applications Have More Trouble Scailing than Traditional SaaS?
The Answer: The Geometric Effect of Network Scaling.
Geometric just means that problems are growing exponentially, as the underlying input grows linearly. This is because device data streams are connected to multiple interconnected processes, which multiply the scale of tha application as it grows.
Many developers working in IoT today come from the Web Application space, where user inputs and session define the scaling problem. They are often under prepared for the complexities introduced as devices scale.
Additionally, Enterprise IoT “deals” are often lumpy, which means a company might have to add thousands of devices to their application in a matter of weeks, which does not give the application time to “break slowly.”
Here are a couple common areas of weakness:
Databases: IoT data comes streaming in, nonstop. As the customer base, grows, so does this data flow. Usually there are processes that manipulate data and then write it back into the same or another database. An inherent problem is that relational databases do not scale, either horizontally or vertically, which means that performance problems will suddenly appear as these architectures reach their breaking point. The only way to get ahead of this problem is to use your load testing capability to constantly stay 2-4x ahead of where you are at the moment (see below). Another alternative is to use a “serverless” database from a Cloud platform to help offload scaling issues onto your service provider. Buyer beware, concurrency issues can be a nasty bug when using large cloud-based NoSQL databases for IoT applications.
In Line Processes: Most IoT applications have some sort of data processing scheme “in line” where data comes in, gets transformed/processed, and then gets passed on to some other process or service. I have seen many teams working on code that assumes that all steps of the process are going to work flawlessly. Then one day, when the system starts getting hammered, the database starts to slow down, and then your nicely threaded processes all start to slow down, until your service runs out of memory. Crash. Game over. These types of queue based processes are great candidates for Serverless functions (see below), so at least you don’t have to worry about vertically scaling during times of slowness.
What do we do??
Design Load Testing From Day One. New IoT applications need to be built with large and powerful data simulators that can “hammer the crap” out of the application in a way that closely mirrors how real devices work. If real data comes in via a Cisco VPN tunnel, then so should the load testing data. If you’re working on an application now that does not have a load testing capability, I highly recommend that you move it up the backlog.
Consider Serverless Architectures. Serverless functions, databases, and message queues are great tools especially early on for IoT products. They do eventually cost more to run than dedicated code running on Linux machines, but it will take a long while. I guarantee you will face fewer scaling pains with these architectures
Stateless Load Balancing. If you do run code in containers or on machines in your application, you MUST include load balancing from moment one. I love to include “kill switches” in my applications that can be controlled from the load testing application, so we can prove that any process/machine/container can die at any time without affecting us. Make sure that you never write so IO process such that the machine that processed an input must be the one to process the output (Example: Websocket connections, message routing, etc), or your load balancing/failover process will not work flawlessly.
Canary. As silly as it sounds, I always write a standalone service that publishes and consumes data in my application end-to-end on a frequent basis, usually from multiple availability zones. When the system starts having problems, this “canary” is usually the first thing to find the problem. Unless you have great….
Instrumentation. Any downtime or outage you couldn’t see before it happened is a failure in your ability to properly instrument your system. Aggregating logs isn’t enough. You should get alerts well in advance of any problem that affects your system.
Solve problems of scale before they kill your business.