A Stochastic Process-Based Look at Software Reliability

All, I wanted to introduce you to another one of my Software Sciences & Analytics colleagues, Vignesh TS.  Vignesh works out of Bangalore and had some really interesting thoughts on software reliability that I thought would be great to share.  Enjoy!

Vignesh-PictureMy interest in stochastic processes based modeling of real life phenomena, led me a couple of years ago to an interesting application: Software reliability.

As a researcher in statistics who enjoys programming (especially in python), the first question I asked was whether it is even appropriate to study the faults in a software in a statistical sense? Honestly there is nothing “random” about a bug in software! But then why were research papers being published in premier journals in statistics on software reliability? It turns out that there is nothing random about a bug, but its discovery and subsequent correction can be modeled as a stochastic process.

Okay, so there seems to be some justification in studying software reliability in a statistical sense, but then why not use the huge body of literature that already exists for engineering reliability. A hidden bug in released software seems at a first glance, to be analogous to an undetected crack in the blades of an operating gas turbine. Well, there is a crucial difference between gas turbines and released software: No two gas turbines can be exactly the same, but two copies of a released software are exactly the same! Obvious as this remark may seem, it has profound implications for why software reliability models should be different from models for engineering reliability.

Is this the only difference between faults in a gas turbine and software? No. Gas turbines can fail because they “age”. There is no concept of aging in software. Well, there is a notion of software aging, applicable to embedded software that runs non-stop without rebooting. A small memory leak in such software can eat-up the device RAM over time and cause a seg-fault (we shall however ignore this for the moment). In reliability lingo, gas turbines have a “bathtub” shaped hazard function, while software would have a monotonically decreasing hazard function.   And by the way, I’d be amiss if I didn’t plug the blog entries my colleagues have written on the work they are doing to help prevent and detect aging and failure in gas turbines.  You should check out,  “How your cell phone camera can improve the performance of a turbine engine”.  But back to the topic at hand…

Software reliability is still an active field of research with the literature full of models and their criticisms. However what is generally recognized is that if we do come up with a correct model for software reliability, it could have the following uses.

  1. Helping a software development team to know when to stop testing and debugging a software and release it. This is called target reliability and could guarantee a customer a certain minimum period of fault free running of the software.
  2. A software company would be interested in knowing whether the reliability of newer versions of software the company has been releasing is increasing.
  3. The company might also like to correlate a software reliability metric of their software with features in their development environment. For example one question worth answering would be to determine whether the usage of certain open source C++ libraries as opposed to a competing in-house library increases the reliability of the software. Such questions and their answers could lead to formulating software development best practices.

Let’s get back to what I wanted to blog about: Stochastic Process based models for software reliability. The first stochastic process model for software reliability was by Jelinsky and Moranda in the 70’s. They modeled the process of discovering bugs and correcting them as a “Pure Death Process”.

Imagine a set of unknown number of individuals in a closed room whose number cannot increase. As macabre as it may sound, assume that the only way we come to know of the existence of an individual in the room is when the individual dies. Initially as there are more individuals, there will be more frequent deaths and as time goes on the frequency of deaths will drop. The rate at which the frequency drops can be used to estimate the number of individuals present in the room initially. Software bugs are like the individuals in the room and their discovery and correction is like the death of an individual. Using the frequency with which bugs were being discovered at the beginning of the debugging through the end of the debugging, we can estimate the number of bugs present when debugging stopped.

The Jelinsky and Moranda model is clearly not realistic, but it was the first model to explore the possibility of using stochastic processes for software reliability. Since then a number of models have been proposed with increasing complexity and realism.

Models like the Jelinsky and Moranda model come under the class of stochastic processes known as self exciting poisson processes.  They are clearly not just applicable to software reliability. With the right tweaks, they can have other applications too.

  1. In ecology to estimate the bio-diversity of an ecosystem
  2. In astronomy for example to estimate the number of Red Giants in a galaxy
  3. In public health to estimate the number of people in a given year and region who have a non-communicable disease like cancer.

I have just touched the tip of an iceberg when talking about software reliability. It is a field where sophisticated probabilistic models can help estimate accurate software reliability metrics. If you are interested in reading more about this field, then a search for software reliability through journals such as IEEE transactions on reliability, Technometrics and Biometrika should yield excellent papers about the current state of research in this field.


2 Comments

  1. Dr. John P. Morgan

    I am conducting some basic research on the importance of innovation management as part of the latest research of solar energy.
    Published papers in well known academic journals are preferred and will be acknowledged as part of academic ethics.

  2. Newt

    I’m not clear on how software bugs can be modeled stochastically:

    ‘fixing’ one bug can add unconstrained new bugs.

    because software (in von neumann machines) can rewrite itself, innumerable new bugs can be introduced by a hidden single bug.

    Really, I’m not clear on what a bug is: a bug can be viewed as code that does not meet a design specification, or a design that does not meet the end-user requirements, or even simply a thread that does not behave as the programmer intended (since the programmer can herself be wrong, that may or may not be a bug from any other view).

    Is it a bug when a program counter overflows and the thread of execution begins again at 0? Not from the hardware design point of view. (it’s a feature!). Since that ‘feature’ could be exploited in software, one has to get to the intentional level – did the programmer (or architect if the programing is from some kind of automatic system) intend for the code to do that. And since programming a system can be a large team enterprise, that intention has to be shared by the team (and we all know how rarely intentions are shared).