Enhancing spatio-chromatic representation with more-than-three color coding for image description

Extraction of spatio-chromatic features from color images is usually performed independently on each color channel. Usual 3D color spaces, such as RGB, present a high inter-channel correlation for natural images. This correlation can be reduced using color-opponent representations, but the spatial structure of regions with small color differences is not fully captured in two generic Red-Green and Blue-Yellow channels. To overcome these problems, we propose a new color coding that is adapted to the specific content of each image. Our proposal is based on two steps: (a) setting the number of channels to the number of distinctive colors we find in each image (avoiding the problem of channel correlation), and (b) building a channel representation that maximizes contrast differences within each color channel (avoiding the problem of low local contrast). We call this approach more-than-three color coding (MTT) to enhance the fact that the number of channels is adapted to the image content. The higher color complexity an image has, the more channels can be used to represent it. Here we select distinctive colors as the most predominant in the image, which we call color pivots, and we build the new color coding using these color pivots as a basis. To evaluate the proposed approach we measure its efficiency in an image categorization task. We show how a generic descriptor improves its performance at the description level when applied on the MTT coding.


Introduction
In color images the values of pixels encode the spectral information of the light reflected by the surfaces in the scene. These values are represented in a k-dimensional color space, and a common formulation of this representation is written as where E(λ) is the illuminant of the scene, S(λ) is the surface reflectance we are looking at, R k (λ) is the sensitivity function of the k-th sensor defining an axis of the color space, and ω is the visible spectrum usually ranging between 400 and 700 nanometers.
1 Equation 1 tells us that color in the physical world is mathematically modelled as a point-based phenomenon. However, when we face the problem of solving higher level visual tasks, such as automatic image classification, to build efficient color descriptors requires the definition of color in its surrounding context.This involves the representation of spatio-chromatic information, which is a difficult problem to deal with. It has been tackled in previous works from different point of views [1,2,3,4]; here we review three main approaches.
The first one, generalized by Weickert [1], is based on considering color differences as partial derivatives computed on each RGB color channel. This was firstly addressed by Di Zenzo [5] who introduced the idea of color tensor. It provided a way of combining the channel gradients to obtain the orientation of the color variation in a local spatial neighborhood. Subsequently, this idea was further developed by Kass and Witkin [6] for oriented patterns, and finally established by Weickert [1] who introduced an additional integration scale which increases the color-spatial coherence.
A second approach by Mäenpää and Pietikäinen [2] is based on computing image descriptors in different color spaces and using the best space for each specific application. This idea led van de Sande et al. [3] to study which combinations of color representation and descriptor were the most appropriate for recognition tasks. They considered well-known three-dimensional color spaces such as device-dependent RGB, colorimetric XYZ, perceptually uniform CIELab and CIELuv, cylindrical HSL and HSV, and physiologically-based opponent space. These spaces were combined with common image descriptors, such as SIFT [7] and GIST [8]. In this direction, Zhang et al. [9] proposed a biologically-inspired descriptor which extends the 3D color space with a fourth opponent channel. Recently, Cernadas et al. [10] searched for the best combination of color spaces, normalization methods and features for texture classification, and González-Rufino et al. [11] studied different colour-texture features to diferentiate cells in histological images.
The third approach is based on the extraction of color blobs (i.e. homogeneous color regions) directly from trichromatic representations. In particular, Alvarez and Vanrell [4] describe an image in terms of shape and color attributes of the image blobs. In this case the blobs are obtained from each channel of the opponent space by using Lindeberg's blob detector [12]. Khanina et al. [13,14] adapted the scale-space technique for color images and proposed to use the Hessian matrix. Ming and Ma [15] proposed a weighted multi-scale blob detector using a hybrid operator which combines the Laplacian and the determinant of the Hessian. The results of this operator are later processed by a blob filter that includes a color-based Förstner operator and a hue-based histogram.
In all the above approaches, a variety of descriptors based on local spatial features have been defined over different three-dimensional color representations,mostly on RGB or opponent color spaces. Here, we hypothesize that the performance of these descriptors for high level visual tasks, such as image classification, can be improved by using color spaces that boost the appearance of the spatio-chromatic image structure. Boosting can be achieved by overcoming two main drawbacks: (a) inter-channel correlation of RGB spaces, and (b) lack of contrast in color-homogeneous regions of opponent spaces. These two effects can be seen in Fig. 1, where important edges between regions of different colors (orange-green edges) present clearer differences in color-opponent spaces with respect to the inter-channel correlated edges in RGB. However spatial structure appearing inside homogeneous-color regions is more contrasted in RGB than in opponent channels, where minor details (across the green or orange area) are lost.
To prove the previous hypothesis, in this paper we propose a new color representation that achieves decorrelation and enhancement of local color contrast based on the following ideas: (a) using more than three channels if required, i.e. adapting color coding to the content of each specific image; (b) enhancing local contrast inside channels by maximizing the contrast with respect to the most representative color of each channel. Following previous ideas, we compute a multi-channel representation of the spatiochromatic image structure in a two-step process. First, we select the set of distinctive image colors, denoted as pivots, which capture the most relevant colors for each specific image. Second, the value of a pixel in each new channel is computed by the similarity between its trichromatic color and the corresponding pivot of the channel. We name the proposed representation more-than-three color coding, since the number of distinctive colors is not restricted to the usual three (although in some cases it can be three, or even two). In general the more color diversity the image has, the more number of color channels our representation has. We will denote our approach as MTT (More-Than-Three) from now on. To test the proposed MTT coding, we use the semi-joint texton descriptor (STD) introduced by Alvarez and Vanrell [4]. This descriptor, based on the Texton theory by Julesz and Bergen [16], decomposes the image into minimal color regions (blobs). These blobs are described in terms of their color and shape attributes, which are not conditioned by the image space. This independence from the space makes this descriptor the most adequate to be directly applied to the new color representation without any additional computation. We report our results on two different experiments. Firstly, we compare the representation capabilities between MTT and two trichromatic representations, namely RGB and opponent space, concluding that MTT allows a more accurate representation of the image content thanks to the properties of presenting lower correlation and higher local contrast, that allows to get a more careful blob-based representation over the full image area. Secondly, we perform an experiment on scene categorization showing that our approach gets a higher accuracy, outperforming state-of-the art results computed at the descriptor level.
Although we show a good performance with the proposed approach, two criticisms to our initial hypothesis may arise. The first one refers to the increase in the number of color channels compared to usual representations. However, the use of extra channels can be linked to recent findings about the existence of multiple hue maps in the human visual system [17,18]. These hue maps show selectivity to more colors than the primaries encoded in three-dimensional opponent spaces. 1 The second criticism refers to tuning to each specific image content. This tuning may complicate the description of images for comparison purposes. However, it assures obtaining a better spatio-chromatic representation for image regions that can otherwise be lost with a fixed coding as it will be shown in the experiments.
The rest of the paper is organized as follows. In Section 2 we detail our new representation. In Section 3 we define the experimental setup and present the results obtained by our approach on the experiments. Finally, in Section 4, the conclusions of the paper are discussed.
2 More-than-three color coding (MTT) Our goal is to define a color representation which has a channel for each distinctive color in the image. By distinctive colors we mean those that play an important role in understanding the image content. We use as many channels as distinctive colors an image has. For a given channel we assign, (i) the maximum value to pixels of the distinctive color, and (ii) a value inversely proportional to the distance to such distinctive color to the rest of pixels. In this way, in each channel, we are maximizing the representation of a distinctive color preserving its spatial coherence. Since all the distinctive colors have their own channel, we ensure that all the important color regions of the image will be fully represented in at least one channel, and that all the region details will be maximally contrasted in its channel. We denote the distinctive color of a channel as its pivot.
Let us note here that the proposed representation is based on the content of each image. Color coding for each image is dependent on the color pivots computed from that particular image. For instance, an image of a forest with four distinctive colors could be represented by a channel for green leaves, a channel for brown tree trunks, another for blue sky, and a last one for white clouds. Meanwhile, an image of a beach could be represented by 3 channels with all the details of yellowish sand on one channel, deep blue of the sea on a second one and light blue of the sky on a third. We want to remark that this representation has not a fixed dimensionality, but it varies from one to any number representing the color complexity of a specific image scene. Nonetheless, we can state that this dimensionality usually converges to a moderate number since natural images are typically dominated by only a few colors [19].
The process to obtain the proposed MTT coding can be divided in two parts: (a) the selection of pivots (Section 2.2.1) and, (b) the definition of the channel values (Section 2.2.2). A general scheme of this process is summarized in Fig. 2. Figure 2: Pipeline of the method. From the original image, we extract a set of ridges corresponding to the most distinctive colors and then we select a color pivot for each ridge. These color pivots are the basis to generate the proposed more-than-three (MTT) color coding of the image, which results on as many channels as distinctive colors the image has.

Selecting color pivots
As we have introduced before, color pivots must be the most distinctive colors of the image. In this work we propose to interpret that distinctive colors are the most predominant ones in the image and we find them using the Ridge-based Analysis of Distributions (RAD) technique [20]. The RAD algorithm groups image colors according to the ridges of the histogram. Ridges are computed by extracting all the local maxima of the histogram and connecting those which are close to each other.
Although other existing approaches could be used instead, we selected RAD because it has been proved to fulfill two properties that are of clear interest to our method. First, the RAD algorithm is invariant to some color distortions as ridges extract all the histogram maxima plus all their nearby similar values, therefore being robust to small changes like the ones caused by noise. Second, all the points in a ridge are connected, which means that the ridge representation is robust to shadows and highlights, since both shadow and non-shadow regions of an object are included in the same ridge. Thus, small color distortions will not affect our method, since they will be captured by the ridge algorithm obtaining always a single color pivot for each dominant color. These effects are not captured by classical clustering methods (e.g. k-means) which group colors mainly based on colour similarity, while RAD allows joining colors from different parts of the histogram in the same ridge, if there is a sequence of local maxima that can be connected. In the next lines we briefly summarize how predominant colors are extracted with this method.
Let us define an image I as a M × d matrix where M represents the number of pixels in the image and d is the dimension of the color space (RGB, Lab, etc.). In RAD, the first step is to look for local maxima on the color histogram H(I) with the multilocal creaseness measure of Lopez et al. [21,22] defined as where x is a bin of the histogram H(I), x k is the k-th neighbor of x on an r-connected neighborhood, ω(x k ) and n(x k ) are the dominant gradient orientation and the unit normal vector to the discrete boundary of the neighborhood at each boundary site x k , respectively, and d is the dimension of the histogram space. All the mathematical details can be found in [22]. In our implementation we use the RGB color space (i.e. d = 3) quantized in 30 × 30 × 30 equally spaced bins. We use r = 6 to consider a 6-connected neighborhood as in the original implementation of RAD [20]. The local maxima of κ(·) which are close in the histogram are connected by following the lines of shallowest gradient descent until a flat region is reached. The sets of points contained in each of these lines are called ridges of the histogram and will be denoted by where c i is a color value from the image. For a particular image, the set of all the ridges extracted applying RAD will be denoted by {C I i } i=1:L and they will represent the most predominant colors of the image I.
Let us now focus on searching for the color pivots. For a particular ridge C I i of an image I, its color pivot, ρ I i , is defined as the one that fulfills that is, ρ I i is the color value of ridge C I i that has maximum value in the image histogram H(·).

Pivot-based encoding
After selecting the set of color pivots {ρ I i } i=1:L of image I, we define the new spatio-chromatic representation as the M × L matrix obtained using the similarity metric given by where I j,· represents the vector consisting of the 3 color components of pixel j from the original image and · m represents the m-Minkowski norm. In this work we have used m = 2 that is equivalent to the Euclidean distance, although other distances, such as the perceptual CIEDE2000 [23], could also be used.
The computational complexity of our approach is linear in the number of pixels of the image for a fixed number of bins and a given dimension of the histogram space (in our case, 30 × 30 × 30 and 3 respectively). Computing the MTT representation for an image of 768 × 768 pixels takes on average 888ms, from which 722ms correspond to the pivot selection (including the time of the RAD method) and 166ms to the pivot-based encoding step. These computations were done on a Intel Xeon CPU E5-1620 processor.
In Fig. 3 we present the MTT representations of a set of images, and we compare them to the RGB and the opponent representations. We can see that each MTT channel enhances different parts of the image. For example, in the first row, MTT channels emphasize different parts of the postbox. The base and the aperture are represented in the black channel, the box is in the red channel, the notice plate and the background trees are mainly enhanced on the gray channel, the grass is represented on the green channel, and the sky appears in the light-gray channel. We can appreciate how color information is less correlated on these channels than in the RGB channels (please, focus on the green and blue channels of RGB) and that opponent channels present less contrast between the different objects of the image. Similarly, in the second row, the different parts of the boy's clothes (in red, blue, and orange channels), the snowman (in the white channel), and the background (in the gray channel) are all enhanced in different channels. An analogous analysis can be performed in the rest of images.
Notice that since our MTT representation is content-based we obtain a different number of channels on each image depending on the variety of colors in it. In the examples, the first two images have five channels whereas the last one showing a purple flower on a green background has only two channels. Notice also that the MTT channels represent different colors for each image. In some cases, as in the first-row image, two shades of the same color can be represented in different channels if they are sufficiently different from each other (in this example, gray and light gray).
Finally, let us explain how we can derive an inverse transform to the original space. By the construction of our space we know that for each channel: i) the color selected as a pivot is always a trichromatic value appearing in the image. Therefore, the maximum value of the channel is equal to the maximum difference between the color of the pivot and the color of a certain pixel in the image; and ii) there exists a pixel in the image (the one with its color at a further distance of the pivot) whose representation in the channel is 0. Mathematically, These two properties, allow us to invert Eq.5 as follows Then, given M T T I j,i and ρ i this last equation defines a surface (an sphere if m = 2) of possible values for each I j,· . Therefore, to recover the original image we just need to know the value of three of the pivots that are linearly independent, and use trilateration. Then, our recovered image will be given by values I j,· , that fulfill Eq. 8 for three values of i.

Illumination invariance
As explained in the introduction, pixel values of an image depend on the reflectance of the objects, the camera sensors, and the illumination of the scene. Therefore, when the illumination of the scene changes (which is usual in real images), pixel values also change thus hindering the performance of computer vision algorithms. Different methods have been proposed to counter-effect the illuminant variability, either by discounting the illuminant [24] or by performing some form of color normalization [25]. In this section, we show that our image representation can be directly used as an invariant to the illumination (therefore avoiding the need of further processing) when computed on the logRGB color space. In RGB space, the change in illumination between two images of the same scene can be approximately modeled by a single scaling factor on each channel (i.e. the Von Kries coefficient law [26]), either directly [27] or by applying the spectral sharpening technique [28,29]. This is, given an image I 1 , an image I 2 of the same scene under a different illuminant can be defined as where D 1,2 is a 3 × 3 diagonal matrix containing the scaling factors for each RGB channel, therefore transforming the colors under the first illuminant to those under the second illuminant. If we apply a logarithm operation to the RGB space, the previous equation can be rewritten as log(I 2 ) = d 1,2 , · · · , d 1,2 + log(I 1 ), where d 1,2 = log(D 1,2 11 ), log(D 1,2 22 ), log(D 1,2 33 ) T . This is the case since D 1,2 is a diagonal matrix and thus the channels of I 1 are treated independently. Equation 10 tells us that an illumination change can be modeled by a translation in logRGB space. Therefore, for any color value x ∈ logRGB we have that where H 1 (·) and H 2 (·) denote the histograms of log(I 1 ) and log(I 2 ) respectively. Consequently, following Section 2.2.1, we have that the color pivots of log(I 1 ) and log(I 2 ) are also related by From Eq. 12 and Eq. 5 we have Therefore, our representation computed on logRGB space is approximately invariant to the illuminant. An example of this invariance is shown in Fig. 4, where we can see, from left to right, the original RGB image, the results of the MTT representation fixing the number of channels to 3, and a visualization of the MTT channels concatenated as an RGB image. It is clear that the MTT channels are very similar for all the images, making the RGB-like visualization stable under illuminant changes.

Experiments and Results
As presented in the previous section, MTT provides a new color representation which is based on the specific content of each image. In this section we show its power to build generic color image descriptors. The evaluation is performed in two steps. We firstly evaluate how MTT overcomes the problems of RGB and opponent spaces to encode spatio-chromatic information of images. We evaluate this improvement in terms of the channels correlation and local contrast, and also showing how MTT representation improves the ability of an specific image descriptor, and secondly we evaluate how MTT increases the performance of a descriptor in a scene classification task.
The evaluation is performed in two steps. We firstly evaluate how MTT overcomes the problems of RGB and opponent spaces to encode spatio-chromatic information of images. We evaluate this improvement in terms of the channels correlation and local contrast, and also showing how MTT representation improves the ability of an specific image descriptor.
Considering the problem of generic image description, the comparison between descriptors of different images built on the MTT representation requires to be adapted to any number of channels. To overcome this problem we use a the Semi-Joint Texton descriptor (STD) [4] and a variant of it, both are explained on the next subsection. This descriptor gives an intermediate-level representation in terms of image blobs, i.e. color-homogeneous convex regions, that is computed regardless of the color space.
Taking into account the previous considerations, we organize this experimental section in four subsections. Firstly, we introduce the image descriptor used in the experiments. Secondly, we provide the details of the setup used in the experiments, which are fully explained in the remaining two subsections.

Image description: Semi-joint Texton Descriptor
The Semi-joint Texton Descriptor (STD) introduced by Alvarez and Vanrell in [4] describes an image in terms of shape and color attributes of the image blobs. STD can be computed on any color space and we show that the performance of this descriptor on scene recognition is improved when MTT is used instead of RGB or the opponent representation. An interesting property of this descriptor is that the attributes of the blobs it uses do not depend on the input color space where the blobs are initially detected. Due to this property, the descriptions of two images can be compared independently of the color representation where the blob detection is performed, even if their representations have different number of channels.
The STD algorithm starts detecting the blobs of an image by applying a multi-scale Laplacian in each separate channel of the image representation of choice. From the blobs detected in all the channels, color and shape attributes are extracted. Then the STD is defined as a combination of shape (ST D S ) and color (ST D C ) descriptors of image blob's attributes (see sections 3.3.1.1 and 3.3.1.2):

Shape descriptor
The shape descriptor is a histogram of blobs' shape attributes. For each detected blob, shape attributes, namely area, orientation, and aspect ratio, are obtained independently of the color channel where the blob was detected. Then, all blobs' attributes are quantized in a three-dimensional blob-shape space in order to compute the histogram. In this histogram each bin represents a visual word of the universal shape vocabulary defined by the quantization of the blob-shape space.

Color descriptor
The color descriptor is a histogram of blobs' color attributes. The histogram is computed in the HSI color space, where blobs' color attributes are quantized. In this histogram each bin represents a visual word of the universal color vocabulary defined by the quantization of the color space (see figure 9 in [4]).
In this paper, we also use a variant of the color descriptor defined in [30]. This approach is based on the color-naming model of Benavente et al. [31], which categorizes any image pixel p in one of the 11 basic colors defined by Berlin and Kay [32] (i.e. red, green, blue, yellow, orange, brown, pink, purple, white, gray, and black). Such categorization is done by means of an 11-dimensional membership vector µ(p), where each component µ i (p) can be interpreted as the probability of color p to belong to a particular color C i . Pixels are assigned to the color term with highest membership, which is then backed up with a modifier related to its lightness (i.e. dark, medium, or light). Using this color-naming representation the quantization of the color space is more perceptual than the original quantization [4], where just an equally-spaced division of the space was used.
To avoid confusions, from now on we denote by ST D OR the original descriptor defined in [4] (shape descriptor plus color descriptor on HSI), and by ST D CN the variant which uses color naming for the color description [30] (i.e. ST D CN is formed by the shape descriptor and the color descriptor based on color names). Figure 5 shows a graphical representation of the two STD implementations used in this work. Figure 5: Diagram of the process to obtain ST D OR [4] and ST D CN [30]. Blobs are detected on each channel of the chosen color representation and shape attributes are computed to generate the shape descriptor ST D S . The color descriptor ST D C is computed either on the HSI color space or using color names.

Adding spatial layout information
The STD descriptor is a global first-order statistic of blob attributes. For scene recognition the insertion of the spatial layout is a must since similar color areas can represent different things depending on their location in the image. For example, medium and large blue blobs can represent either water (e.g. a lake or the sea) or sky; in this sense, adding their spatial location will help to distinguish if they represent water (usually located at the bottom images) or sky (usually located at the top).
Hence, we add the spatial component similarly to how it is added in the GIST descriptor [8]. Given an image I we decompose it in a set of non-overlapping sub-images I 1 , · · · , I k , which are obtained by dividing each of the image dimensions by a particular natural number (usually 2, 3, or 4). Then, we compute the descriptor for each of the sub-images and concatenate them, obtaining a final descriptor of the form where ST D Si and ST D Ci represent the shape and color descriptors of sub-image I i .

Experimental setup
In our experiments, the maximum number of channels for the MTT representation is set to L = 8. This value was experimentally found by testing values from L = 2 to L = 11. Results gradually improve as the value of L increases, but for L > 8 the improvement is not significant. In case that more than 8 ridges are extracted from an image (see Section 2.2.1), the 8 ridges that represent the largest areas of the image (computed via a watershed in the color histogram of the image) are selected. To obtain the shape descriptor, we use the following quantization of the shape space: 8 orientations (0 • , 22.5 • , 45 • , 67.5 • , 90 • , 112.5 • , 135 • , 157.5 • ), 7 scales (area), and 3 aspect ratio values (isotropic, elliptical, and highly-elongated). Isotropic blobs are assigned to orientation 0 • . Thus the shape descriptor has dimension 119 = (8 orientations × 7 scales × 2 aspect ratios) + 7 (one bin per scale for isotropic blobs).
In the case of the color descriptor we have used the two configurations explained in Section 3.3.1.2. For ST D OR , the HSI color space is quantized in 16 bins for H, 4 for S, and 5 for I, making a size of the color descriptor of 320 bins. For ST D CN , color is defined in terms of 11 names and 3 modifiers, which gives a size of 33 bins for the color descriptor. Therefore, the total size of ST D OR is 119 + 320 = 439 bins, whereas ST D CN has a the total size is 119 + 33 = 152 bins. If spatial decomposition is used (see Section 3.3.1.3), these values should be multiplied by the number of sub-images considered to obtain the final size of the descriptor.
Finally, the dataset used in all the experiments is the dataset of scenes created by Oliva and Torralba [8], which contains 2688 images of 256 × 256 pixels from 8 categories: coast, forest, highway, inside city, mountain, open country, street and tall building.

Experiment 1: Analysis of MTT properties
In this first experiment we analyze the properties of the proposed color representation. As we mentioned in the introduction, the main problems of usual color spaces to encode the spatio-chromatic image structure are due to the high correlation between channel and the lack of local contrast for specific colors. These two properties are inherent to the channel-based representation derived from the sensor that reduces the capability to represent all the image details. Even when we transform to an opponent representation, the lack of contrast of the new chromaticity channels does not allow representing all the details of areas with homogeneous chromaticity. Considering these two aspects, in this experiment we have computed the inter-channel correlation and the channel's local contrast for RGB, normalized opponent space 2 (nOPP), and the MTT representation. have also considered the space defined by the three eigenvectors obtained by PCA on the RGB space.
For a given image, the inter-channel correlation has been computed as the average of the minimum pairwise-channel correlation 3 obtain the local contrast, we use the method defined by Haun and Peli [33].
The results are shown in Table 1. We can see that the MTT representation presents a combined result of low inter-channel correlation and high local contrast. If these results are compared to the ones obtained by the opponent space, we see that MTT obtains better results in both measures. PCA presents the lowest correlation at the cost of also obtaining the lowest local contrast. Comparing to the RGB space, local contrast of RGB channels is slightly higher than in MTT, but in RGB the correlation between its channels is considerably higher than in MTT. We also looked at the behavior of local contrast when considering only the three MTT channels that have higher local contrast for each image. In this case, the result for MTT is over 10% higher than in RGB, therefore showing that a subset of the MTT channels presents higher local correlation than any other representation of the same dimension.
Let us now show how the better results of MTT in correlation and local contrast allow us for a better image description. To this end, we detect the blobs in each image of the dataset (using the blob descriptor encoded in the ST D descriptor) on different color representations to analyze how well these blobs describe the content of the image. We assume that, in general, the most area covered by detected blobs, the best the overall appearance of the image will be described. Thus, an image can be reconstructed by plotting their blobs at the locations where they were detected, and filling them with their color attribute. Figure 6 shows a visual comparison between the blobs detected on the proposed MTT, the normalized opponent color space, and the RGB space. We can appreciate that on MTT more parts of the image are described, the details are better represented and the overall structure of the original image (i.e. the gist of the image) is more appreciable. To give a quantitative analysis of the results in the previous figure, in Table 2 we show the percentages of covered area by blobs detected on RGB, the opponent space and MTT. As it can be seen, the percentage of area covered by blobs detected on MTT is higher than the ones obtained on the other color representations. This increase can be found in all the categories of the dataset. For example, in the forest category the increase is over 13% with respect to RGB. This could be due to the fact that images from this category have low contrast and similar hues, which makes that areas of similar color can not be detected as different regions in the opponent or the RGB channels. By contrast, MTT is more able to represent different shades of the same hue in different channels which facilitates the posterior blob detection.
Finally, let us analyze how our better detection of the gist of the image translates to the shape descriptor part ST D S . To this end, in Fig. 7 we compare the distributions of detected blobs from an image using opponent and MTT representations. Distributions are displayed as 3D histograms where one of the axes represents orientation, another represents aspect ratio and area jointly, and the third represents the number of blobs. We can appreciate that each visual word in ST D S clusters blobs with a similar visual appearance (i.e similar area, orientation, and aspect ratio). We note that ST D S on the MTT channels detects more blobs than on the opponent space, specially on those bins where some blobs are already detected on the opponent space. Moreover, MTT allows to detect blobs with attributes corresponding to bins where only a few blobs are detected on the opponent channels. These extra blobs detected on the MTT representation are mainly found at large uniform areas, which explains why MTT is more effective representing the overall structures of the image as we have seen in Figure 6.

Experiment 2: Scene recognition
In this experiment we test the efficiency of the new representation when it is used to compute the STD for scene recognition tasks. We first compare different spatial decompositions to determine the best configuration of STD and then we compare the results to the state of the art on the database of Oliva and Torralba [8]. The experiments are done following the same methodology used in [34]. A linear support vector machine is trained and tested on a randomly selected split of 600 images for training and 120 images for testing. This procedure is repeated 10 times and results are averaged.

Analysis of spatial decomposition
As stated in Section 3.3.1.3, the inclusion of spatial information on STD can improve its results for general tasks in computer vision. Spatial information is a building part in some image descriptors, such as GIST [8], but it should not be confused with the idea of spatial pyramids [35], where the descriptor is computed on regions of different sizes and are later combined into a single descriptor. To analyze the relationship between the number of sub-images used and the accuracy achieved, we have computed the results of the original implementation of STD (ST D OR ) and STD using color names (ST D CN ) on different color spaces, and considering the whole image (no spatial decomposition) and different number of sub-images (4, 9, and 16). According to these results (see Fig. 8), the inclusion of spatial information in the descriptor by dividing the image into 4 sub-images increases the accuracy by at least 4% in all cases. Considering 9 sub-images still increases the accuracy but the increase is not as remarkable as in the previous case. After that, the increase is not significant or there is even a slight decrease in accuracy in the case of the descriptors computed on RGB.

Comparison to state of the art
Given the results of the previous section, we use 4 sub-images to compute ST D OR and ST D CN because this configuration provides us with good performance and the size of the descriptor does not increase dramatically (1756 for ST D OR and 608 for ST D CN ). Now, these results are compared to the ones reported in [34] for three well-known descriptors: SIFT [7], GIST [8], and HMAX [36] and are presented in Table 3. Rows 1 to 3 summarize the results of Brown and Süsstrunk in [34]. We only report the color space where each descriptor achieved the best results. The highest accuracy was obtained with GIST on the opponent space (without normalization). Rows 4 to 6 and 7 to 9 show the performance of ST D OR and ST D CN , respectively. In both cases, the descriptor is computed on three color representations (RGB, normalized opponent space, and MTT). Table 3: Accuracy (%) and standard deviation computed over 10 trials on the scene recognition experiment. Results for HMAX, GIST, and SIFT were extracted from [34]. In parenthesis we show the color space used. Analyzing the results, the use of MTT on both STD descriptors provides with an improvement on the accuracy of about 4% and 4.5% comparing to RGB and the normalized opponent space, respectively. This result can also be observed in Fig. 8 where for any number of sub-images, any descriptor computed on MTT overcomes the same descriptor computed on RGB or on the normalized opponent space.
Moreover, we computed the Wilcoxon test with the hypothesis that the results obtained with GIST on the opponent space and with ST D CN on MTT in the 10 trials of the experiments belonged to the same distribution. We obtained a p-value of 0.0020 with a significance level of 5%. Therefore, we can reject our null hypothesis and conclude that the improvement obtained by ST D CN on MTT over GIST on the opponent space are statistically significant. Therefore, we can conclude that the use of MTT improves the results of the STD descriptors with respect to the use of RGB or the normalized opponent space. Furthermore, both ST D OR and ST D CN computed on MTT outperform GIST results reported in [34]. Let us remark here that our best result (ST D CN on MTT channels with an accuracy of 83.0%) is obtained with a descriptor composed by 608 bins, while the GIST descriptor has a size of 960 bins.
Moving to the analysis by category, Fig. 9 shows the confusion matrix of our best result (ST D CN on MTT). Each cell of the matrix shows the percentage of images of a class (row) classified as each of the classes (columns). From the matrix we can see that the category with higher accuracy is forest. This could be expected since this category shows a low intraclass variability. By contrast, open country and coast present a high confusion (e.g. 14% of coast images are classified as open country). Similarly, city and tall building are two categories with a certain confusion (8% of images of each class classified in the other class). Both cases can be explained by the fact that images in these pairs of categories show high similarities; for example, open country has many images of lakes and rivers that can be confused with images of coast, and city category contains many images of buildings combined with other elements such as cars and pedestrians that can be confused with images from the tall building category.

Conclusions
The main novelty of this work is the creation of a new color representation based on the specific content of the image. With this approach we aim an image color coding that enhances spatio-chromatic information and reduces inter-channel correlation. The goal is achieved in a two-step process. Firstly, we set the number of channels used in MTT with the number of relevant colors the image has, defined as pivots. Secondly, we build individual channel representation that maximizes contrast differences using a similarity metric with respect to the color pivot related to each channel.
The proposed approach presents some clear advantages: • Represents images according to its own color complexity, this is with more than three dimensions if required. As each dominant color is mostly represented in one of the dimensions, our approach shows more ability to capture the image details.
• Increases the local contrast and reduces the correlation of the resulting channels, which plays a crucial role in several tasks such as edge and blob detection, segmentation, and recognition.
• Presents illuminant invariance properties if it is built onto a log space. This can be an important benefit, essentially in recognition tasks.
• Increases performance when applied to build color image description for scene classification. This increase is mainly due to the improvement in the blob detection step of the color descriptor.
To prove these advantages we have performed two experiments. First, a qualitative experiment to show the performance of the MTT representation in a blob detection task. We visualize how the proposed approach presents low correlation and high local contrast, and how it improves the area covered by detected features across the full image plane. A second quantitative experiment has been performed for scene recognition. We show how the same descriptor improves its performance when applied on the MTT coding, and we compare the results to current state-of-art descriptors, which are overcome.
In the future, we plan to study the impact of MTT to detect keypoints. Many descriptors use the Harris-Laplace detector to select keypoints in the image where the descriptor is computed. The increase in the image contrast in the MTT channels could allow the Harris-Laplace operator to detect more points and this fact could improve the results of the image descriptors computed on those locations.

Notes
1 Hue maps are defined as clusters of neurons that peak when a specific color stimuli is presented. Although, a lot of research is left to be done in this area, some interesting results have started to arise: there are more hue maps in higher levels than the six opponent colors [18,37], and the peaks of the cell responses are given by particular hues [38,39,40]. 2 As defined in the C-SIFT descriptor [3]. 3 We use this measure instead of a global correlation average due to the different number of channels in each color representation.