This column is by Co-Founder/CEO at Getty/IO Inc, Diogenes Ianakiara
For the last eight years, I have witnessed several companies worldwide absorb millionaire losses because of the lack of scalability and not expected bugs in their systems.
Due to moments of high demand (peak), the systems are forced to scale in a huge way lots of them with dozens or hundreds of servers in a cluster.
This is a common scenario in banks, insurers, public companies and multinationals with high transactional volume and online users
During these crises, some companies have created the famous “War Room”, a big room where all the responsible for the system are summoned. You can find directors, managers, system architects, developers, infrastructure and business analyst or even performance and troubleshooting analysts of specialized consulting there.
Normally, nobody asks why it happens — and if does, usually does not take any actions. Everybody needs to solve the problem as soon as possible. When it is solved, and usually is, we enter into the cycle again.
“We are “n” days without crises.” comes out like a joke.
If your company constantly suffer from the similar crisis as the described in this article, it is a signal that maybe some things are not going so well and the systems produced have little or none quality standard.
A software architecture should take into consideration its exponential growing, performance, scalability, testability and interoperability.
Software created ten years ago are a legacy that did not expect the increasing use of mobile devices. As the time goes by, these systems are forced to deal with dozens or even hundreds of integrations to cope with the constant business demands. What was a simple service now is a critical system that was not designed for such task.
The companies depend increasingly on technology to differentiate from competitors. A buggy software could cause a not previously calculated injury.
Some companies have a test team with more than 100 people. Although it is expensive, this “swollen” testing model has been for a long time getting good results.
To validate a system with 100 testers could be a solution to find bugs or regressions; however, as the time goes by, the cost of a software could be very high.
The software development and quality processes in companies can be seen in several ways. Some companies believe that they have, some that they have used and others say that they use it, although the hardest is to understand if what has been applied is having a positive result.
I have witnessed isolated cases where the disorganization was so big that even the deployment for the production environment was made by a developer through his workstation.
According to Wikipedia, the software QA is an area of knowledge of software engineering that guarantees the quality of the software through the definition and normalization of the development process.
The Contract (QA)
At Getty/IO Inc we use Cucumber with BDD for end-to-end tests, TDD for unit tests, Code Lint to make sure that everybody use the same codification pattern and Code Review to instruct the best programing practices. All this together with an integration and continuous delivery system (Continuous Integration — CI / Continuous Delivery — CD).
With a team of 1–4 developers, it is possible to create end-to-end tests that can run on demand as many times as necessary.
The practices of functional and automated tests (Cucumber + BDD) is a requirement in each Build since we have a better control and effectiveness in the propagation of the code in each environment.
To speed up this process, we need to identify 20% of the functionalities, which represent 80% of the business. For this, we use the “Pareto principle”. That way we can obtain result faster and increase gradually the functionalities test cover.
PROTIP: Bugs are a priority. The responsible developer should create a correction branch and an automated test for each bug. Therefore, we can guarantee that it does not show up again.
We use the same procedure to increase the code quality and avoid regressions, which happen when a not bug feature shows errors or when, once fixed up, the problems recur.
The tests happen locally during the development time and remotely for each pull request.
Another developer of the same team should evaluate the pull request only after approved in all health checks.
PROTIP: The contract is simple: No code should be merged without passing successfully by all the tests. The new or updated codes cannot be accepted without unitary tests.
As a result, your team will worry about to increasing gradually the tests coverage, and you will notice that the systems will not have many problems anymore.
Your team will mature faster, and the software’ quality will increase, your headache will decrease, your development are will be more effective, and your users will thank you.
Tools that can help you in this process
The Release (Continuous Delivery)
In 2012 the Knight’s Electronic Trading Group (ETG) was trading about 21 billion dollars per day, with $400 in assets they got bankrupt in 45 minutes because of a failed deployment (You can read about it here).
In 45 minutes, Knight went from being the largest trader in US equities and a major market maker in the NYSE and NASDAQ to bankrupt.
The deployment of a release should be automatic for the development & staging environment only if the tests performed successfully.
At this moment, the tests’ cover cannot decrease. Otherwise, it means that the developers are not testing or they are blocked.
The developers are responsible for verifying functionalities on the development environment and communicate that a new version is ready to be tested.
Together with the users, the Product Manager is responsible for verifying the system in a staging environment using a checklist.
The validation made by Product Manager should not substitute the functional and automated tests.
PROTIP: You can use the Pareto principle to elect what should be part of the validation checklist.
Once the system is validated on staging, the product manager should schedule a new deployment for the production.
As a result, the system will pass through all the tests to be deployed to desired environment. In a case like mobile app is still possible to automate the delivery for Apple Store, Google Play.
Tools recommended for CD
Application Performance Management
It is common to find systems that do not scale with a high demand. On 70% of the cases, which we work in, the environment configuration, where the problem. This kind of problem usually is quickly solved and can be avoided with the run of Continuous Loading Testing for each system release.
PROTIP: While this kind of problems can be easily fixed they also can cause a not calculated injury.
Examples of errors that can be fixed quickly.
- Thread Pool Size
- Connection Pool Size
- Open Files Limit (SO)
- Disk Sizing
- Load Balance Persistence
- Connection Time Out
30% of the remaining problems are the most complex to be solved. It may be codification, thirty-party library, infrastructure, etc…
Some of them are:
- Memory Leak
- Thread Concurrency
- Garbage Collection
- Cluster Sizing
- HA ( High Availability )
- Shared Memory
- Cache Limits
To avoid problems like these is recommended the planning and running of stress tests. These tests should be run to identify what are the application boundaries and to understand what part of the system that becomes the bottleneck with the X quantity of simultaneous users.
PROTIP: It is possible to run automated performance tests using a continuous integration system
A misunderstood or poorly stress test can provide non-realistic data since the number of users to shutdown the system could be smaller than what expected.
Using APM tools, you can create interactive dashboards that will show in real-time the health of your system during each test sessions.
PROTIP: A stress test is successful executed when part of the system or the entire stack crashes somehow. If the system has not been crashed, something is wrong on the test conception.
APM RECOMMENDED TOOLS
The execution of a stress test should present the limits of your system, with how many users the service quality broke down.
For each test accomplished, an evidence should be submitted e analyzed by a specialist. Suggestions of optimizations and scalability should be presented if the target performance were not reached.
PROTIP: For each change on the environment or application, it is recommended a new test session until the goal is reached.
Although for some companies this process is manual and expensive, a stress test could be easily adapted to the continuous integration system (CI) of your company and drastically reduce its running costs. As a result, you can run the stress test on demand for each modification of your system without the need to hire specialized performance consulting.
Stress test recommended tools
For the next years, the system that today are only surviving will be extinct because IoT and VR market increasingly will demand more infrastructure and scalability.
PROTIP: The use of a continuous integration and deliver system (CI/CD) is crucial to automate the quality process.
Tests should be written on a daily basis, always applying techniques like Unit Test, Integration Tests, and End-to-end.
Eventually, your team will need the training to apply the best methods and practices.
The release time is a risk factor because while new companies spend only a few hours to publish a new service, some need months to turn the same service into reality. Thereby, the development costs are equally bigger.
PROTIP: Big companies need bureaucratic processes; however, the innovation cannot be addressed in that way.
The development process should be seen in a strategic way. Smart and profitable companies like Uber, Facebook, Netflix increasingly invest in innovation and software quality.
To get into this game, today’s companies should suit and offer even more effective system for its consumers, the companies will need to turn into the companies of the future and define a high-quality pattern in the software development process.
Achieve excellence in software development can be a simple or a long process; it depends on how your team accepts the challenge. The team should be willing to achieve the goals, follow new paths and intending to get always better.
Together, we imagine and create web and mobile applications that scale automatically and are always available through any device anywhere.
We reinvent and review complex software architecture using the same technology and processes that serve millions of users around the world.
If do you have problems like described here, send a message to firstname.lastname@example.org, We can help!
This is a curated post. The statements, opinions and data contained in these publications are solely those of the individual authors and contributors and not of iamwire or its editor(s). This article was originally published by the author here.