Friday, March 23, 2012


I think scalability is where true engineering takes place. Each vendor will have scalability documentation on how much load their product can support. These are usually conservative amounts so that they don’t promise too much. These need to be examined very carefully, since they will almost never provide you ALL the information you need, but are just a part of the puzzle.

Imagine that you are deploying a system for around 10,000 users. Out of these, there are perhaps 2000 concurrent users doing various tasks. The system will already have some data in it and new data will be coming in hourly. So you look at the vendor scalability documentation and it provides a case scenario of a 2000 user system, with 100 concurrent users doing one or two common tasks and data volume much lower than yours. Therefore, multiplying the environment in the scalability documentation would not provide you with the performance you are looking for.

This is mainly because, depending on the user actions, the bottle neck may occur in different system areas. Here is where the knowledge of backend technologies plays a crucial role. You will need to speed up the bottle neck areas and the only way to figure out how is through theory and experience.

For example, let’s say that during operating hours, a back end database like SQL is hard-hit and is causing a bottle neck. Instead of the common shotgun approach of upgrading hardware and moving to faster storage whip out performance monitor and start logging metrics. Windows Performance Monitor is a hugely underrated tool. It can provide valuable insight into many system performance issues. However, in this scenario, it would not be enough to determine which hardware component is causing the bottle neck. You would need to monitor specific SQL performance metrics to determine why the hardware component is causing the bottle neck, and then address the issue.

Going on with the example, you can run PerfMon to determine that hard drives are causing the bottle neck; it’s pretty common, since storage is usually the slowest part of the system. Drilling deeper, you can determine that the issue is due to constant reads from disks, at which point you can dig even deeper to see what the cache hit ratio is. If it is low, you can enhance the cache hit ratio to increase performance of SQL without necessarily upgrading hardware. You may also notice that some storage mediums are being heavily read, while another storage array is underutilized. In that case, moving the database or a few heavily read tables to that medium would also enhance performance.

Remember, various storage architectures are better for reading or for writing. It’s not uncommon to have different arrays for the same database server for different databases or even tables in the database. Impact of other components should be considered as well.

Furthermore, we can seek optimization outside of the bottle neck component. For example, let’s say that front end web servers are being heavily taxed. If the sessions are encrypted, it could be because of the large overhead for SSL and authentication. In this scenario, you can look into offloading SSL authentication at a different front end component such as an application delivery controller.

Don’t underestimate simple tools for purposes of scalability. A simple performance monitor and Microsoft Excel with the analytics plug-in is all you need in most cases.

Creatively scaling solutions is a dying talent because most of the time, people think it’s easier to just throw more hardware at the issue.