Knowledge Based Evaluation of Software Systems: a Case Study 1

Solving software evaluation problems is a particularly difficult software engineering process and many contradictory criteria must be considered to reach a decision. Nowadays, the way that decision support techniques are applied suffers from a number of severe problems, such as naive interpretation of sophisticated methods and generation of counter-intuitive, and therefore most probably erroneous, results. In this paper we identify some common flaws in decision support for software evaluations. Subsequently, we discuss an integrated solution through which significant improvement may be achieved, based on the Multiple Criteria Decision Aid methodology and the exploitation of packaged software evaluation expertise in the form of an intelligent system. Both common mistakes and the way they are overcomed are explained through a real world example.


Introduction
The demand for qualitative and reliable software, compliant to international standards and easy to integrate in existing system structures is increasing continuously. On the other hand, the cost of software production and software maintenance is raising dramatically as a consequence of the increasing complexity of software systems and the need for better designed and user friendly programs. Therefore, the evaluation of such software aspects is of paramount importance. We will use the term software evaluation throughout this paper to denote evaluation of various aspects of software systems.
Probably the most typical problem in software evaluation is the selection of one among many software products for the accomplishment of a specific task. However, many other problems may arise, such as the decision whether to develop a new product or acquire an existing commercial one with similar requirements. On the other hand, software evaluation may have different points of view and may concern various parts of the software itself, its production process and its maintenance.
Consequently, software evaluation is not a simple technical activity, aiming to define an "objectively good software product", but a decision process where subjectivity and uncertainty are present without any possibility of arbitrary reduction.
In the last years, research has focused on specific software characteristics, such as models and methods for the evaluation of the quality of software products and software production process [2,6,7,9,11,12,23]. The need for systematic software evaluation throughout the software life cycle has been recognised and well defined procedures have already been proposed [22] while there are ongoing standardisation activities in this field [7].
However, although the research community has sought and proposed solutions to specific software evaluation problems, the situation in this field, from the point of view of decision support, is far from being satisfactory. Various approaches have been proposed for the implementation of decision support for specific software evaluation problems. Typically, the evaluator is advised to define a number of attributes, suitable for the evaluation problem at hand. In certain cases [7,14], a set of attributes is already proposed. The evaluator, after defining the alternative choices, has to assign values and, sometimes, relative weights to the attributes and ultimately combine (aggregate) them to produce the final result. The problem is that, in practice, this is frequently accomplished in a wrong way, as it will be exemplified in this paper.
The authors have participated extensively in software system evaluations, either providing decision aid to the evaluators or as evaluators themselves. After reviewing the relative literature, it seems that the most commonly used method is the Weighted Average Sum (WAS), known also as Linear Weighted Attribute (LWA), or similar techniques, such as Weighted Sum (WS) (see [11]).
Another frequently used method is Analytical Hierarchy Process (AHP, see [17,23]) where priorities are derived from the eigenvalues of the pairwise comparison matrix of a set of elements when expressed on ratio scales. WAS and AHP fall into a class of techniques known under the name Multiple-Criteria Decision Aid (MCDA, [16,20]). This methodology is applied in the evaluation problems where the final decision depends on many, often contradictory, criteria. This paper identifies some common flaws in decision support for software evaluation. Our approach consists in exploiting packaged evaluation expertise in the form of an intelligent system, in order to avoid common mistakes and to speed up the evaluation process. We have developed ESSE, an Expert System for Software Evaluation, which supports various methods of the MCDA methodology and we present its use in a real example, the evaluation of five different proposals for the information system modernisation in a large transport company.
The rest of the paper is organised as follows: Section 2 introduces decision support for software evaluation and discusses some common errors during the implementation of a decision process for software evaluation. Section 3 highlights the manifestation of these errors in a practical example, while section 4 presents briefly ESSE. Section 5 shows the generation of a correct evaluation model using this tool. Finally, section 6 concludes the paper and poses future directions

Decision Making for Software System Evaluations
Given the importance of software, it is necessary to evaluate software products and related processes in a systematic way. Various criteria must be analysed and assessed to reach the final decision. As already mentioned, software evaluation problems fall into the class of decision making problems that are handled through the MCDA methodology. In order to render this paper selfcontained, the basic steps of a systematic MCDA decision process are described in section 2.1.
Finally, section 2.2 discusses the current situation in software industry regarding decision making and identifies a list of typical wrong steps, taken frequently in nowadays decision making for software evaluation.

Steps of a Typical Decision Process
An evaluation problem solved by MCDA can be modeled as a 7-ple {A,T,D,M,E,G,R} where [20]: -A is the set of alternatives under evaluation in the model -T is the type of the evaluation -D is the tree of the evaluation attributes -M is the set of associated measures -E is the set of scales associated to the attributes -G is the set of criteria constructed in order to represent the user's preferences -R is the preference aggregation procedure In order to solve an evaluation problem, a specific procedure must be followed [12]: Step 1: Definition of the evaluation set A: The first step is to define exactly the set of possible choices. Usually there is a set A of alternatives to be evaluated and the best must be selected. The definition of A could be thought as first-level evaluation, because if some alternatives do not fulfill certain requirements, they may be rejected from this set.
Step 2: Definition of the type T of the evaluation: In this step we must define the type of the desired result. Some possible choices are the following: -choice: partition the set of possible choices into a sub-set of best choices and a sub-set of not best ones.
-classification: partition the set of possible choices into a number of sub-sets, each one having a characterization such as good, bad, etc.
-sorting: rank the set of possible choices from the best choice to the worst one.
-description: provide a formal description of each choice, without any ranking.
Step 3: Definition of the tree of evaluation attributes D: In this step the attributes that will be taken into account during the evaluation and their hierarchy must be defined. Attributes that can be analyzed in sub-attributes are called compound attributes. Sub-attributes can also consist of sub-subattributes and so on. The attributes that can not be divided further are called basic attributes. An example of such an attribute hierarchy is shown in figure 1.
It should be noted that there exist mandatory independence conditions, such as the separability condition, and contingent independence conditions, depending on the aggregation procedure adopted (see [18]).  numbers, while the second type are verbal characterizations, such as "red", "yellow", "good", "bad", "big", "small", etc.
A problem with the definition of M d is that d may not be measurable, because of its measurement being non-practical or impossible. In such cases an arbitrary value may be given, based upon expert judgment, introducing a subjectivity factor. Alternatively, d may be decomposed into a set of subattributes d 1 , d 2 , … d n , which are measurable. In this case the expression of arbitrary judgment is avoided, but subjectivity is involved in the decomposition.
Step Step 6: Definition of the set of Preference Structure Rules G: For each attribute and for the measures attached to it, a rule or a set of rules have to be defined, with the ability to transform measures to preference structures. A preference structure compares two distinct alternatives (e.g. two software products), on the basis of a specific attribute. Basic preferences can be combined, using some aggregation method, to produce a global preference structure.
For example, let a 1 and a 2 be two alternatives and let d be a basic attribute. Let also m d (a 1 ) be the value of a 1 concerning d and let m d (a 2 ) be the value of a 2 concerning d. Suppose that d is measurable and of positive integer type. In such a case, a preference structure rule could be the following: • product a 1 is better than a 2 on the basis of d, if m d (a 1 ) is greater than m d (a 2 ) plus K, where K is a positive integer • products a 1

and a 2 are equal on the basis of d, if the absolute difference between m d (a 1 ) and m d (a 2 )
is equal or less than K, where K is a positive integer Step 7: Selection of the appropriate aggregation method R: An aggregation method is an algorithm, capable of transforming the set of preference relations into a prescription for the evaluator.
A prescription is usually an order on A.
The MCDA methodology consists of a set of different aggregation methods, which fall into three classes. These are the multiple attribute utility methods [8], the outranking methods [20] and the interactive methods [19] (although originated by different methodological concerns we may consider AHP as a multi-attribute utility method since priorities computed by this procedure define a value function). The selection of an aggregation method depends on the following parameters [20]: • The type of the problem • The type of the set of possible choices (continuous or discrete) • The type of measurement scales • The kind of importance parameters (weights) associated to the attributes • The type of dependency among the attributes (i.e. isolability, preferential independence) • The kind of uncertainty present (if any) Notice that the execution of the steps mentioned above is not straightforward. For example, it is allowed to define first D and then, or in parallel, define A, or even select R in the middle of the process.

Current Practice and Common Problems in Decision Making for Software Evaluation
Although the need for a sound methodology is widely recognised, the evaluators generally avoid the use of MCDA methods. In [5] it was observed that only two MCDA methods were used by the decision makers, while [10] reported that even unsophisticated MCDA tools were avoided.
Nevertheless, a lot of research has been dedicated to the problem of selecting the best method for software evaluation. During controlled experiments, it was found that, generally, different methods gave similar results when applied to the same problem by the researchers themselves. During a study by [11] it was empirically assessed that the above statement was true for evaluations made by independent decision makers as well. Yet, the degree of the application difficulty differs between the MCDA techniques and, in certain cases, a particular MCDA method is recommended. As an example, in [18] the use of the Multiple Attribute Utility Technique (MAUT) is suggested, while in [23] the AHP method has been adopted.
In many cases no systematic decision making method is used at all. Evaluations are based purely on expert subjective judgement. While this approach may produce reliable results, it is more difficult to apply (because experts are not easily found, formal procedures are hard to define) and has a number of pitfalls, like: − inability to understand completely and reproduce the evaluation results, − poor explanation of the decision process and the associated reasoning, − important problem details for the evaluation may be missed, − high probability that different experts will produce different results without the ability to decide which one is correct, − difficulty in exploiting past evaluations, − risk to produce meaningless results. The number of top-level attributes used in the evaluation model plays a significant role in software evaluation problems. It may seem reasonable to presume that the higher the number of attributes used, the better the evaluation will be. However, our experience in applying MCDA has demonstrated that at most seven or eight top-level attributes must be employed. The reason is that this number of attributes is sufficient to express the most important dimensions of the problem. The evaluator must focus on a relatively limited set of attributes and must avoid wasting time in low importance analysis. If necessary, broader attributes may be defined, incorporating excessive lower importance attributes. The latter may be accommodated in lower levels of the hierarchy tree of figure   1. Another weak choice is to include in the model attributes for which the alternative solutions in the choice set A can not be differentiated. The result is the waste of evaluation resources without adding any value to the decision process. Finally, custom evaluation attribute sets (e.g. [14]) are rarely used in practice, meaning that packaged evaluation experience is not adequately exploited.
Another point concerning attributes is redundancy. Hasty definition and poor understanding of the evaluation attributes may lead to this common error. As an example, consider the quality attribute hierarchy described in [7], part of which is shown in figure 1. Suppose that the evaluator decides to utilise this schema in a software evaluation problem, in which, probably along with other attributes, he is considering Cost and Quality. Some of the quality sub-attributes are related to the attribute Cost.
These are Usability, Maintainability, and partially, Portability, which are defined in [7] in terms of "...effort to understand, learn, operate, analyse, modify, adapt, install, etc.". If Cost is defined to comprise maintenance, installation, training, etc. costs, then the combined cost and quality attribute structure will force the evaluator to consider twice the above mentioned cost components. Although less critical, redundancy may also manifest in the form of metrics that are strongly related, e.g.
"elapsed time from user query to initiation of system response" and "elapsed time from user query until completion of system response".
Another common source of errors is the use of arbitrary judgements instead of structural measurements. In general, arithmetic scales should be preferred to ordinal scales whenever possible, because of the subjectivity involved with the latter. On the other hand, there is already a large volume of literature for the use of structural metrics, such as the complexity and size metrics referred above, to compute or predict various software attributes. Although there is a lot of debate about their practical usefulness and measurements may require dedicated tools, these metrics should be assessed and used in software evaluations. However, in practice, most evaluators prefer the use of ordinal values, relying purely on subjective assessments and avoiding the overhead of software measurements. This may not necessarily lead to wrong results, provided that appropriate ordinal aggregation procedures are used.
However, as mentioned above, software engineers are rarely aware of such techniques. The outcome is that the evaluation results are produced in subtle ways and the decision makers and their management are rarely convinced about their correctness.
The most critical point in evaluation problems is the selection of the appropriate aggregation method, according to the specific problem. In practice, it seems that the most widely known method in the software community is the Weighted Sum, and consequently, whenever a systematic approach is taken, Weighted Sum or Weighted Average Sum is the most frequent choice. This is probably caused by the fact that Weighted Sum has been also used in various software engineering disciplines, such as Function Point Analysis [1].
As mentioned in step 7 of 2.1, the selection of a suitable aggregation method depends on a number of parameters, such as the type of attributes and the type of the measurement methods. In practice, frequently an inappropriate aggregation method is chosen and its application is subject to errors. For example, WAS is applied with attributes for which, although initially an ordinal scale is assigned, an arithmetic scale is forced to produce values suitable for WAS application (good is made "equal" to 1, average to 2, …). As another example, an outranking method could be selected while trade-offs among the criteria are requested. In fact our criticism focuses on the unjustified choice of a procedure without verifying its applicability. For instance AHP requires that all comparisons be done in ratio scales. This is not always the case. If we apply such a procedure on ordinal information we obtain meaningless results. A more mathematical treatment of the above issues and a detailed analysis on aggregation problems and drawbacks can be found in [3] (see also [4,13]).
Finally, another issue in the software evaluation process is the management of the decision process. Typically, the effort and resources needed for an evaluation are underestimated. A software evaluation is a difficult task, requesting significant human effort. Systematic evaluation necessitates thorough execution of the seven process steps referred in 2.1, probably with computer support for the application of an MCDA method or the measurement of certain software attributes. Software evaluations are normally performed with limited staff, under a tight schedule and without clear understanding of the decision process and its implications. In general, managers seem to be mainly interested in the generation of a final result, i.e. they are willing to reach the end of the decision process, but they seem somewhat careless about the details, the data accuracy and the overall validity of the approach taken.

A Real World Example: a Weak Evaluation Model
Recently a large transport organisation was faced with a rather typical evaluation problem: the evaluation of certain proposals for the evolution of its information system. System specifications were prepared by a team of engineers, who defined also the evaluation model to be used (denoted by M in the following). A number of distinct offers were submitted by various vendors. A second team (the evaluators) was also formed within the organisation. They applied the imposed model, but the results they obtained were in contradiction with the common sense.

Description of the Evaluation Problem
The new information system was intended to substitute the legacy information system of the acquirer organisation. The old system was of the mainframe type, but the new one should support the development of client server applications, while maintaining the existing functionality.
The main hardware components of the system were the central server machine, certain network components and a predefined number of client terminals and peripherals (various models of printers, scanners, tape streamers). In all cases there were various requests concerning the accompanying software, e.g. there were specific requests about the server's operating system, the communications software, etc. Additionally, some software components were also required: an RDBMS system, a document management system (including the necessary hardware), a specific package for electronics design and a number of programming language tools.
An additional requirement was the porting of the existing application software, that used to run on the legacy information system, to the new system. No development of new functionality was foreseen.
Finally, certain typical requirements were also included, regarding training, maintenance, company profile (company experience, company size, company core business, ability to manage similar projects), etc.
Five offers (i.e. technical proposals, denoted by P i , where i = 1,2,3,4,5) were submitted as a response to the request for tenders, consisting of a technical and a financial part. Only the technical part of the proposals was subject to the analysis examined in this paper. Of course, the final choice was decided taking into account both aspects.

The Initial Evaluation Model
The initial evaluation model consisted of 39 basic attributes, with various weights having been assigned to them. The proposed aggregation method was the Weighted Sum (WS). A full description of this model is shown in Appendix A.
Although, at a first glance, the attributes seem to be organised in a hierarchy (e.g. NETWORK seems to be composed of Network design, Active elements, Requirements coverage), they are actually structured in a flat manner, due to the requirement for WS. This means that each top level attribute (denoted by bold letters in table 1) has an inherent weight, which equals the sum of the weights of its sub-attributes. This is depicted in Table 1.
According to the model, an arithmetic value must be assigned to each attribute. Two types of arithmetic scales are permitted, ranging from 40 to 70 and 50 to 70 correspondingly, as shown in Appendix A. The weights were assigned in M with the intention to reflect the opinion of the people that assembled the model about the relative importance of the attributes. The problems related to M are the following:   This generates the risk of assigning low importance to an attribute, simply because it cannot be easily decomposed. Another weak point is the inclusion in M of attributes, for which, although a significant importance had been assigned, it was inherently hard to propose more than one technical solutions, such as Programming Languages 1 and 2, with weight 3.
2. There is some redundancy between certain attributes. For example, it is difficult to think that the selection of the network active elements and the network wiring are completely independent.
3. Although the weights are supposed to reflect the personal opinion of the requirement engineers, 4. The specified measurement method generates side effects, which distort the evaluation result.
Consider for example the effect of the lower limit of the two arithmetic scales. Suppose that, P i and P k offer poor technical solutions for two attributes C 1 and C 2 respectively and that the evaluator decides to assign the lowest possible rating, while still keeping alive the two proposals. Suppose also that C 1 and C 2 have been given an equal weight, but scale 40-70 is allowed for C 1 and 50-70 is allowed for C 2 . Therefore, P i is given a 40 for C 1 , while P k is given a 50 for C 2 . This means that, although the evaluator attempts to assign a virtual zero value to both proposals, the net result is that P k gains ten evaluation points more than P i . If the weight of the two attributes is 6, this is translated to a difference of sixty points for P k , a difference that could even result to the selection of this proposal, in case that the final scores of the proposals are too close to each other. Other similar cases, supporting the evidence that current practice in decision support for software is very error-prone have been observed by the authors.

The Expert System for Software Evaluation (ESSE)
ESSE (Expert System for Software Evaluation, [21]) is an intelligent system that supports various types of evaluations where software is involved. This is accomplished through the: • automation of the evaluation process, • suggestion of an evaluation model, according to the type of the problem, • support of the selection of the appropriate MCDA method, depending on the available information, • assistance provided by expert modules (the Expert Assistants), which help the evaluator in assigning values to the basic attributes of the software evaluation model, • management of past evaluation results, in order to be exploited in new evaluation problems, • consistency check of the evaluation model and detection of possible critical points.  A detailed description of ESSE is given in [21].

Using ESSE to improve the Evaluation Model
In this section we present the application of two evaluation models in the problem described in the third section. The first model is the initial model discussed in 3.2, while the second model is a new one obtained with the assistance of ESSE. The two models and the results they produced are presented in sections 5.1 and 5.2 correspondingly.

Application of the initial evaluation model
Initially (i.e. before consultation with the authors) the evaluators examined the technical contents of the five proposals. From this informal analysis it was quite evident that the second proposal (P2) was the best from the technical point of view. Subsequently, the evaluators assigned values to all basic attributes for the various alternatives and obtained the results of Table 2 (the entire model M is presented in Appendix A -the assigned values are not shown for simplicity). The results shown in Table 2 were against the initial informal judgement: proposal P3 was ranked first, while P2 was ranked second. Besides, the differences between the scores obtained for each proposal were quite small, giving the idea that the five proposals were very similar from the technical point of view. This may be true for the system software and client terminals, but not for other attributes. The results were not surprising given the criticism presented above against M and the expected distortion of the evaluators' opinion.  Table 2: Results of the application of the initial evaluation model M, using Weighted Sum

Generation and application of an improved evaluation model using ESSE
The problem we are dealing with is characterised by ESSE as "Information Systems Evaluation" and is a sub-class of the more generic class of the "Commercial Product Evaluation" problems. The knowledge base of ESSE already contained past evaluation problems of the same type. The evaluators merely asked ESSE to propose an attribute hierarchy, together with the corresponding weights and scales, for this type of problems. The model proposed by ESSE consisted of a hierarchical attribute structure, with 8 top level attributes, decomposed in a number of sub-attributes. Additionally, certain model items concerning application software were expanded with predefined quality attribute structures, which were refined versions of the quality scheme of [7].
The evaluators modified the proposed model, according to the special characteristics of the problem imposed by the initial model M. Essentially, they removed some attributes they considered irrelevant and they added some others, which they considered important.  The evaluators accepted the above analysis, except for the sub-attribute 'backup system', which they considered irrelevant and they removed it. Moreover, they accepted the proposed scales but they gave a weight of 4 to the sub-attribute '24 hours service', because they considered it more important.
The complete improved model (denoted by M') is shown in Appendix B, while a top level view may be found in Table 4.   At each entry of Table 5, the numbers indicate the top level attributes for which a relation S j (x,y) holds. For example, the entry in row P5 and column P3 with values 4, 5 and 6 indicates that the relations S 4 (P5,P3), S 5 (P5,P3) and S 6 (P5,P3) hold, i.e. P5 is at least as good as P3 with respect to the attributes 4 ('System Software'), 5 ('Application Software') and 6 ('Maintenance').
Having computed the S j (x,y) relations for the top level attributes and taking into account their weights, we can compute general S relations with respect to all the top level attributes simultaneously.
These are the following: This ordering was in accordance with the evaluators' intuition, ranking proposal P2 first and proposal P3 second. We also observe that there are differences in the lower positions of the ordering in respect with the results obtained by the initial model M, while there is also indifference between P1 and P4.

Conclusions -Future Research
In this paper we have discussed common errors made during the evaluation of software systems.
These errors are caused because of insufficient understanding of certain fundamental principles of decision support methodologies and wrong practical application of evaluation techniques in the software industry. The manifestation of these problems has been exemplified through a real world example: a tender evaluation concerning the acquisition of a new information system for a large organisation. The pitfalls of a specific evaluation model have been pinpointed and it has been shown how an improved model has been generated with the use of an intelligent system, which exploits successfully packaged knowledge about software problem solving.
We plan to examine more evaluation situations in order both to enhance the knowledge bases of ESSE and to obtain more feedback on the benefits of the adoption of a knowledge based solution combined with a sound decision support methodology.
Moreover, we plan to study and implement evaluation knowledge maintenance within ESSE. Currently, the system proposes an evaluation model obtained by a specific past evaluation problem or defined by an expert. Our intention is to derive an evaluation model from various past evaluation problems of the same type, in order to exploit the expansion of the knowledge base of the system.
Finally, more MCDA methods will be implemented within ESSE, along with the necessary knowledge for their correct application.
Once the global outranking relation is obtained, we can deduce: -a strict preference relation p(x,y), between x and y: p(x,y) ⇔ s(x,y) ∧ ¬ s(y,x); -an indifference relation i(x,y), between x and y: i(x,y) ⇔ s(x,y) ∧ s(y,x); -an incomparability relation r(x,y) between x and y: r(x,y) ⇔ ¬s(x,y) ∧ ¬ s(y,x); The definition of the relation S(x,y) is such that only the property of reflexivity is guaranteed.
Therefore, neither completeness nor transitivity holds, and thus S(x,y) is not an order on set A. In order to obtain an operational prescription, the relation s(x,y) is transformed in a partial or a complete order through an "exploiting procedure" which can be of different nature ( [20]). In this specific case a "score procedure" has been adopted. More precisely consider a binary relation ≥ being a weak order. Then we have: where σ(x) = |{ y: S(x,y)}| -|{ y: S(y,x)}| .
σ(x) being the "score" of x computed as the difference between the number of alternatives to which x is "at least as good as" and the number of alternatives which are "at least as good as" x.