Efficient brain lesion segmentation using multi‐modality tissue‐based feature selection and support vector machines

Support vector machines (SVM) are machine learning techniques that have been used for segmentation and classification of medical images, including segmentation of white matter hyper‐intensities (WMH). Current approaches using SVM for WMH segmentation extract features from the brain and classify these followed by complex post‐processing steps to remove false positives. The method presented in this paper combines advanced pre‐processing, tissue‐based feature selection and SVM classification to obtain efficient and accurate WMH segmentation. Features from 125 patients, generated from up to four MR modalities [T1‐w, T2‐w, proton‐density and fluid attenuated inversion recovery(FLAIR)], differing neighbourhood sizes and the use of multi‐scale features were compared. We found that although using all four modalities gave the best overall classification (average Dice scores of 0.54  ±  0.12, 0.72  ±  0.06 and 0.82  ±  0.06 respectively for small, moderate and severe lesion loads); this was not significantly different (p = 0.50) from using just T1‐w and FLAIR sequences (Dice scores of 0.52  ±  0.13, 0.71  ±  0.08 and 0.81  ±  0.07). Furthermore, there was a negligible difference between using 5 × 5 × 5 and 3 × 3 × 3 features (p = 0.93). Finally, we show that careful consideration of features and pre‐processing techniques not only saves storage space and computation time but also leads to more efficient classification, which outperforms the one based on all features with post‐processing. Copyright © 2013 John Wiley & Sons, Ltd.


INTRODUCTION
White matter hyper-intensities (WMH) are regions in the brain white matter (WM) that appear with bright signal on T2-weighted (T2-w) and fluid attenuated inversion recovery (FLAIR) MR modalities. They are a possible risk factor for Alzheimer's Disease (AD) and vascular dementia, with progression associated with vascular factors and cognitive decline [1]. To quantify these changes in large scale population studies, it is desirable to have fully automatic and accurate segmentation methods to avoid time-consuming, costly and non-reproducible manual segmentations. However, WMH segmentation using a single modality is challenging, because their signal intensity range overlaps with that of normal tissue: in T1-weighted (T1-w) images, WMH have intensities similar to grey matter (GM), and in T2-w and proton-density (PD) images, WMH look similar to cerebro-spinal fluid. FLAIR images have been shown to be most sensitive to WMH [2], but can also present hyper-intensity artefacts that can lead to false positives. To improve the WMH segmentation performance, additional discriminative information is extracted from multiple MR modalities. The most successful lesion segmentation methods in the literature have been developed for the detection of multiple sclerosis lesions, with a recent grand challenge comparing the performance of various techniques [3]. Lesion segmentation algorithms can be categorised into unsupervised clustering or (semi-)supervised voxel-wise classification [4]. Unsupervised methods suffer from the issue of model selection. Supervised methods such as neural networks [5], k-nearest neighbour (kNN) [2], Naive Bayes classifier [6] and Parzen windows [7,8] have been proposed. Neural networks can be efficient, but designing an appropriate network architecture and setting the parameters are difficult.
We present an SVM-based segmentation scheme whose preliminary results were presented as a conference paper in [9], and inspired by the work in [1,10]. Lao et al. applied four steps: preprocessing (co-registration, skull-stripping, intensity normalisation and inhomogeneity correction), SVM training with AdaBoost, segmentation and elimination of false positives. Our implementation utilises a similar but more advanced pre-processing pipeline and a simpler training procedure. As one of the primary causes of errors in other approaches is false positive cortical regions, we use information from multiple modalities to define a mask of potential WMH. This mask is built from patient specific tissue segmentation and atlas-based population tissue priors. It leads to three main advantages compared with existing techniques. First, such careful feature selection enables to have a more accurate model without the use of boosting. Second, it limits the areas where the classification is performed on the training set, which means a faster overall brain classification. Third, it reduces the false positive regions that are usually found with naive classifiers, so the advanced post processing required by other techniques [1] are not necessary. We also evaluated the relative value of each MRI acquisition protocol for WM lesion segmentation. This scheme is quantitatively validated on a significantly larger dataset including healthy ageing, mild cognitive impairment and AD subjects. These results were compared with other supervised classification algorithms such as kNN, Naive Bayes, Parzen windows and decision tree.

CLASSIFICATION AND SUPPORT VECTOR MACHINE THEORY
Lesion segmentation can be formulated as a binary classification problem. The SVM technique [11] solves it in a supervised way: given l labelled features .x i , y i / 2 X f 1, 1g, it builds a function f W X ! R such that y../ D sign.f ..// is an optimal labelling function. The function f is a solution of the optimization problem: where K W X X ! R is a Mercer Kernel, H K its associated Reproducing Kernel Hilbert Space of functions X ! R and its corresponding norm k k K , and V is the hinge loss defined as V .f .x/, y/ D maxf0, 1 y f .x/g. The loss function V controls the labelling performance, and the second term controls the smoothness of the solution.
The optimization problem is convex because of the convexity of the hinge loss function. However, as the objective function is not differentiable, the problem is reformulated with additional slack variables 1 , : : : , l 2 R: subject to: i > V .f .x i /, y i / 8i 2 f1, : : : , lg The Riesz representation theorem states that the solution of (1) exists in H K , and can be written as follows: Therefore, the problem that˛must solve is max 1 ,:::,˛l 2R K.x i , x j / D max 1 ,:::,˛l 2R 2˛T y ˛T KT he training vectors with˛i ¤ 0 are called the support vectors. The optimization maximises the margin, which is the distance between the decision boundary and the support vectors.

Data
The dataset used in this paper comes from the AIBL study [12], where T1-w (160 240 256 image, spacing 1.

Proposed algorithm
The proposed algorithm uses the standard supervised classification design for segmentation: given images and corresponding segmentations, the goal is to build a classifier to segment new images Lesions can be seen in the FLAIR and T2-w as a bright signal.

Input data
Pre-processed dataset

Feature definition
Training set

Machine learning
Classifier Testing set Segmentation in the mask Figure 2. Supervised classification algorithms for segmentation aim at building a classifier from images and corresponding segmentations. To obtain good performance, adequate pre-processing, mask and feature definitions have to be used. This application-specific part is followed by a machine learning process where a classifier is built from training examples, to segment new (testing) images.  ( Figure 2). To obtain good performance, adequate preprocessing, mask and feature type have to be defined. This application-specific part is followed by a machine learning process. As summarised in Figure 3, the proposed algorithm consists of the following steps: Pre-processing: Images were rigidly co-registered [13], bias-field corrected [14], smoothed using anisotropic diffusion and histogram equalised to a reference subject. T1-w images were segmented into WM, GM and cerebro-spinal fluid using an expectation-maximisation (EM) approach with priors [15]. For each modality, features were extracted within the mask defined later, and scaled to [0, 1]. Multi-modality features were created by concatenation of single modality features. Neighbourhood intensities features (3 3 3 and 5 5 5 sizes) and pyramidal features (with four levels, taking one voxel per level, Gaussian kernel convolutions of D f0.5, 1, 1.5g) were examined. Mask creation: A global threshold on FLAIR images provides a high sensitivity, but poor specificity, which means it can only be used to define areas of interest. To further refine the areas of interest, we define the region W as the intersection of the dilated Colin WM mask [16] (which was registered rigidly [13] then non-rigidly [17] to the subject) and the WM mask (from the tissue segmentation in patient space). Using the mean W and standard deviation W of the FLAIR intensities on W , an intensity threshold of W C W on W ( D 2 was the numerical value used) is used to define the mask M ( Figure 4): Machine learning: A subset of 10 000 features, with half belonging to the lesion class, the other half belonging to the non-lesion class, randomly selected and equally distributed among the training samples was used to generate the classifiers. A MATLAB implementation solving SVM in its primal formulation was used [18,19]. The chosen kernel was the (Gaussian) radial basis function. The width of the kernel and the regularisation weight were selected via a 10-fold cross validation. Then, each image in the test set was segmented within the patient mask created. Pixels outside this region were set to the non-lesion class. As post-processing, all the connected components segmented as lesion with less than 10 voxels were removed.

Validation
The dataset was randomly and equally split into training and test sets. A classifier was built using the training set, and then used to segment the test set. Then, training set and test set were swapped, another classifier was built, and the rest of the segmentations were computed. Results were then merged. Model performances were compared using the Dice score TP, TN, sensitivity and specificity. Lower is better for FP and FN. Statistical significance was analysed via the p-values of paired t-tests [21]. We performed experiments to test the influence of the combination of modalities, the influence of the feature type and the influence of using the mask in pre-processing instead of in the post-processing. The performance of the SVM classifier was compared with the performance of other supervised classification algorithms. As the overall lesion load impacts the segmentation performance, as previously reported in [2], results are displayed for low (< 3 mL), moderate (3 10 mL) and severe (> 10 mL) lesion loads. Finally, the impact of the parameter in the mask creation was evaluated by performance bounds of the segmentation performance.

Performance with regard to modality combinations
The segmentation performance was evaluated for various combinations of modalities (using 3 3 3 neighourhood features). Figure 5 shows DSC, FP and TP values for four single-modality and four multi-modality features. TN, FN, sensitivity and specificity values are similar for the various combinations, so corresponding graphs are not displayed. When using one modality, FLAIR gives the best performance. However, combining several modalities reduces FP and increases TP. Table I indicates that on low and moderate lesion loads, the T1-w + FLAIR combination performs statistically better than FLAIR (T2-w + FLAIR does not). On the overall dataset, the T1-w + T2-w + FLAIR combination performs statistically better than FLAIR. The model with the four modalities performs the best ( Figure 5), but not significantly better than T1-w + FLAIR (p D 0.50).

Performance with regard to features type
Using the four modalities, the segmentation performance was evaluated for various feature types ( Figure 6). With neighbourhood intensity features, a 5 5 5 size slightly increased the DSC compared with 3 3 3, but the difference was not statistically significant (p D 0.93). Pyramidal features with four dimensions did not perform as well as neighbourhood intensity features, but the

Performance with regard to the mask use
Using the FLAIR modality and 3 3 3 neighbourhood feature type, the impact of the mask on the segmentation performance has been evaluated. The use of the mask in the pre-processing instead of post-processing significantly decreased FP and led to a much better DSC (Figure 7). The computation time in the prediction step being linear in the number of features to label, computing predictions for a significantly lower number of features (only within the mask) reduced the computation time (41 times computation speed-up on our dataset with D 2).

Performance comparison with other supervised classification algorithms
Using the FLAIR + T1 combination and 3 3 3 feature type, the performance of the SVM classifier was compared with several other supervised classifiers: kNN (with k D 100 as in [2]), Naive Bayes, Parzen window and decision tree.
The kNN classifier categorises a feature x to the class with the highest cardinality among the k nearest neighbours of x in the training set.
The Naive Bayes method computes the posterior probability of a feature x belonging to each class y, and classifies according to the largest posterior probability (Equation (9)). To compute the parameters of the probability density of feature x given class y, the features are assumed conditionally independent given the class.
In the case of the Parzen window, the prior probability is estimated as where K is, for example, the Gaussian kernel. The Parzen window classifier is then found using the Bayes rule: On this dataset, using combined 3 3 3 features from FLAIR and T1 modalities, SVM obtained the best DSC results, followed by Parzen windows, kNN, decision tree and Naive Bayes ( Figure 8). In terms of FP, kNN, Parzen window and SVM provide the best results. These tree classifiers also provide the best specificities. However, kNN and Parzen window classifiers have a quite low sensitivity, whereas SVM is most sensitive (followed by classification tree).

Performance bounds from mask parameter setting
The influence of the parameter in the mask creation on the segmentation performance was evaluated. Three performance bounds were computed, which are independent of the modality combination and feature type.
First, the upper bound of the final FP, which is equal to the number of FP for the 'lesioneverywhere' classifier (worst-case scenario in terms of FP). Figure 9(a) shows the FP upper bound for different values of . The higher is, the smaller is the mask M (which is the area used for the machine learning), and therefore, the lower is the FP upper bound. Second, the lower bound of the final FN, which is obtained when using the optimal classifier within the mask (best case scenario in terms of FN). Figure 9(b) shows the FN lower bound for different values of . The higher is, the smaller is the mask M , the more lesion features will be left out, and therefore, the higher is the FN lower bound. Third, the upper bound of the final Dice score, which is similarly obtained using the optimal classifier within the mask. The higher is, the lower is the Dice upper bound (Figure 9c).
These graphs give insight on the impact of the parameter. If is too high, the final segmentation performance will be low, no matter how good is the classifier segmenting inside M . If is too low, the algorithm has a high risk of FP, which is a known drawback in WM lesion segmentation. Setting therefore involves a trade-off between the use of tissue-information to reduce the risk of FP and a

CONCLUSION
We have presented a machine learning scheme applied to the WMH segmentation problem. Our approach is inspired by the previous work on SVM, but has a number of differences. It combines the use of tissue segmentation, atlas propagation techniques and SVM classification to get efficient and accurate segmentation results. Using our pipeline and our dataset, SVM has a higher classification performance than other supervised algorithms such as kNN, Naive Bayes, Parzen window and decision tree.
In this work, we also quantified the relative performance variations with regard to different modalities or feature types. Regarding the modalities, our results confirm that using all of the four modalities adds discriminative information and improves the segmentation results, as reported in [1]. However, our quantitative results show that using only FLAIR and T1-w can give similar performance at a lower cost. One reason could be the lower axial resolution of our T2-w and PD images. Regarding the features types, there is a trade off between the complexity, storage place and computation time versus the performance.
As other important contribution of this work, the mask we define and use in the pre-processing has several positive impacts. First, it improves the classifier performance as the training features are selected in regions of interest, which leads to better classifiers. We have given insight on the trade-off related to the threshold parameter selection in the mask creation. Increasing the threshold decreases the upper bound of FP (and therefore the potential risk of final FP). However, a low FP upper bound comes at a price as increasing the threshold increases the FN lower bound and decreases the Dice upper bound, which means a threshold too high would cause poor final performance no matter how good the classifier is. Second, computation time and storage space required are significantly lower (41 times lower on our dataset with the chosen threshold) as features and predictions are computed in a restricted area. Finally, using our mask in the pre-processing makes most of the complex post-processing steps required in the current state-of-art methods redundant.