New insights in the reproducibility of visual and electronic tooth color assessment for dental practice

Background The aim of the study was to compare a 2D and 3D color system concerning a variety of statistical and graphical methods to assess validity and reliability of color measurements, and provide guidance on when to use which system and how to interpret color distance measures, including ΔE and d(0M1). Methods The color of teeth 14 to 24 of 35 patients undergoing regular bleaching treatment was visually assessed and electronically measured with the spectrophotometer Shade Inspector™. Tooth color was recorded before bleaching treatment, after 14 days, and again after 6 months. VITAPAN® Classical (2D) and VITA-3D-Master® (3D) served as reference systems. Results Concerning repeated measurements, the 2D system was superior to the 3D system, both visually and electronically in terms of ΔE and d(OM1), for statistics of agreement and reliability. All four methods showed strong patterns in Bland-Altman plots. In the 3D system, hue was less reliable than lightness and chroma, which was more pronounced visually than electronically. The smallest detectable color difference varied among the four methods used, and was most favorable in the electronic 2D system. Comparing the methods, the agreement between the 2D and 3D system in terms of ΔE was not good. The reliability of the visual and electronic method was essentially the same in the 2D and 3D systems; this comparability is fair to good. Clinical relevance The 3D system may confuse human raters and even electronic devices. The 2D system is the simple and best choice.


Background
Valid and reliable measurements of tooth color are of major importance in esthetic and restorative dentistry as well as in dental technical practice. Tooth color is usually described based on the Munsell color space in terms of hue, value, and chroma [1,2]. Hue measures the basic color, value indicates the lightness of a color, and chroma measures the saturation or intensity of a color. Value is determined first, followed by chroma, yielding hue as the third dimension. One of the most important prerequisites is the assessment of tooth color either via visual comparison with prefabricated color scales or using measuring devices such as a colorimeter, spectrophotometer or digital imaging systems with corresponding software [3]. The most common method in clinical practice is still the visual method using VITAPAN® Classical shade guide, which is a 2D system. In 1998, the VITA 3D-Master® shade guide was launched on the dental market. It was developed to systematize color determination, thereby enhancing the likelihood of valid and reliable color measurements [4][5][6][7]. Concerning the systematic determination, however, an implicit prior belief about the VITA 3D Master® was not checked in developing this color guide: namely, that any two 3D shades within the same dimension at given constant shade values of the other two dimensions can be well differentiated by the human eye. In fact, dentists and dental technicians believe that the third dimension (hue) is problematic and that the distance between adjacent 3D shades is not large enough in this dimension. To quantify color differences, ΔE as the Euclidean distance between two points in the color space of the three dimensions (value, chroma, and hue) has been used in the majority of dental color studies [8][9][10][11][12][13][14][15][16][17][18][19][20], although a modification of ΔE is preferable [21]. However, numerous studies comparing visual and electronic methods have been published over the past decade [3,8,11,[18][19][20][22][23][24][25][26][27].
Taking tooth color measurements is a complex process. In psychology and statistics, it is well known that repeated measurements [28,29] or groups of observations such as on patients' teeth increase reliability [30,31]. Moreover, the favored ΔE to measure color differences cannot be applied to important graphical and statistical methods for the assessment of validity and reliability, including Bland-Altman plots to examine patterns of disagreement and the intraclass correlation coefficient (ICC) to estimate measurement variability [32]. These limitations can be overcome by using the distance of each shade from 0M1 of the 3D color system, denoted by d(0M1) [33]. Because d(0M1) does not distinguish shades of the same radius from M1, d(0M1) and ΔE are complementary rather than competing. For example, in studying bleaching effects, d(OM1) may be favorable for 0M1 but less favorable for comparing shades by gender and age groups (or to study whether the gender difference in tooth color increases with age). In general, validity depends on the purpose [34] and is to be redefined for every research question; there is no such thing as a universal gold standard [35,36]. Likewise, choosing methods to assess reproducibility depends on the purpose [37]. Whereas reliability is often related to calibration or comparability of examiners before and during performance of large cross-sectional or multicenter studies (only one measurement per participant in the fullscale investigation), the smallest detectable difference or the smallest detectable change is sought in longitudinal studies (at least two measurements per participant; measurement error occured twice or more) [37], when the difference between repeated measurements is in the focus of interest. The smallest detectable difference or, in the present context, the smallest detectable color difference (SDCD), describes a statistical property and is different from perceptible or acceptable color difference thresholds. The SDCD of a row of teeth can easily be recalculated from the SDCD of a single tooth [31]. The SDCD may differ from method to method and from study to study; it contradicts the idea that color difference thresholds are universally valid. In other words, the concept of a universal color difference threshold is scientifically misleading because it confuses validity and reliability. Moreover, color metrics are arbitrary, color perception is subjective, and acceptable color shade differences vary among different colors (ΔE: 1.1 among red shades and 2.1 among yellow shades) [38]. Despite these limitations of color science, it can serve as a rough guide for color difference thresholds and may be useful in daily tooth color determination in dentistry. Therefore, different aspects must be considered when comparing the conventional 2D system with the newer 3D system. This seems more reasonable, because it is more ordered. Ordering alone, however, may not be enough, because the human or electronic rater must have the chance to measure reliably. Whereas directly adjacent shades of the 3D system have mean ΔE values of about 3.8 for lightness (1M1-2M1-3M1) and 4.4 for chroma (2M1-2M2-2M3), the mean ΔE value is only about 1.5 for the six direct neighbors of hue (2L1.5-2R1.5;2L2.5-2R2.5) [38].
Thus, it can be hypothesized that hue is measured less reliably than lightness or chroma. This can be examined not only for an electronic rater but also for a human rater; within-subject comparisons are justified because the examiner serves as her/his own control (hue as exposure versus lightness or chroma as reference), similar to n-of-1 trials [39].
The aim of this study was to compare the 2D and the 3D color systems concerning a variety of statistical and graphical methods to assess validity and reliability, as well as to provide guidance on when to use which system and how to interpret ΔE and d0M1.

Subjects and clinical procedure
In order to better assess clinically relevant color changes, color measurements were performed in patients receiving a regular in-office bleaching treatment (BT). The tooth-inclusion criteria for performing BT were no caries, endodontic treatment or restorations. Patients with insufficient oral hygiene, previous BT, periodontal disease, pregnancy, and allergy or hypersensitivity to the bleaching agents were excluded. The study was approved by the ethics committee of the Medical Association (Ärtzekammer) of Mecklenburg-Vorpommern (Reg. Nr.III UV 15/08). All patients gave informed consent. Thirtyfive patients (24 women, 11 men, average age 30 years) from the Dental Clinic at the University of Greifswald participated. The complete clinical procedure was performed under standardized conditions according to the standardized clinical protocol for in-office bleaching under the supervision of an experienced dentist (AW).
The bleaching procedure was performed on teeth 15 to 25 and 35 to 45. Supra-and subgingival plaque, stains and calculus were removed, and all teeth were polished with non-fluoridated, oil-free pumice before bleaching.

Visual and electronic color assessment
The color of labial surfaces of teeth 14 to 24 was visually assessed by an experienced dental technician, who was ophthalmologically examined before this study [40], under diffuse daylight between 11 a.m. and 3 p.m. The time needed for color assessment was not restricted. Electronic measurements were performed with the spectrophotometer Shade Inspector™ (Schütz-Dental, Rosbach, Germany) by a dentist calibrated prior to this study [40]. The color systems VITAPAN® Classical (2D-VC; VITA Zahnfabrik, Bad Säckingen, Germany) and VITA 3D-Master® (3D; VITA Zahnfabrik, Bad Säckingen, Germany) served as reference systems. The VC color system has a two-dimensional structure that enables the description of hue (category A to D) and lightness including chroma (group 1 to 4) [41]. It serves as the standard shade guide for visual color assessment in dental practice. The 3D color system has a threedimensional structure that enables the separate description of lightness (1 to 5 and 0 for bleaching), chroma (1 to 3, including half points), and hue (M, L, R) [42]. For the measurement procedure, each tooth was categorized into the gingival (S 1 ), the body (S 2 ), and the incisal (S 3 ) segment. The incisal segment S 3 was not included in the analysis due to its transparency. Measurements were carried out as described in the previous study [33]. Time points of visual and electronic measurements were before BT (T 1 /T 2 -Baseline), 14 days (T 3 /T 4 ) and 6 months (T 5 /T 6 ) after BT (Fig. 1).
Statistical methods ΔE = ((ΔL*) 2 + (Δa*) 2 + (Δb*) 2 ) 1/2 and ΔE 00 [43] were calculated. ΔE 00 is superior to ΔE, but its calculation is quite sophisticated. Irregularities in the color space are corrected as follows: 1. the differences in the individual dimensions are calculated; 2. weighting is carried out; 3. finally a term for the interaction between the chroma differences and the hue differences is added; the calculation includes 22 lines of formulae [43]. ΔE 00 values are usually smaller than those of ΔE [21]. Here, we focused on ΔE because it is more commonly used. The Bland-Altman plot [44] is one of the most frequently cited methods in medicine. Although several adaptations have been discussed [45][46][47][48][49], we present only the classical plot with the mean difference and the limits of agreement for d(0M1), which is ΔE of each shade from 0M1. For method comparisons, but not for intra-rater comparisons, the regression line was added. Out of 840 paired observations, a total of 30-55 observations can be expected to be outside the limits of agreement according to M. Bland [50]. Besides the limits of agreement (difference between measurements ±1.96* standard deviation of the difference [44]), we present the agreement within 2.7 [16] and 3.7 [51] units of d(0M1) and ΔE. These agreement statistics and the difference between the pairs of observations (denoted by d 2d 1 for d(0M1), including standard deviation, are the only measurement error statistics also reported for ΔE. The standard error of measurement (SEM) is a further agreement statistic and reported in two versions [37], for which the values are very similar herein. The SDCD is defined as 1.96*√2*SEM ≈ 2.77*SEM [37]. The SDSC on the level of groups of observations or patient's teeth is calculated according de Vet et al. 2001 [31]. In addition to agreement statistics, which are related to differences of repeated measurements, we present reliability statistics, which are related to calibration or comparability of raters or methods [34]. The fraction of the total measurement variance due to variance among teeth is estimated by three versions of the intraclass correlation coefficient (ICC) [28]. Whereas the ICC (3,1) ignores systematic differences between the two methods, raters, or measurements of the same rater, the ICC (2,1) includes an additional term of the variance among raters to account for the total measurement variance (denominator) [28,37]. Thus, the greater the systematic difference between two raters, the smaller the ICC (2,1) compared with the ICC (3,1) . The ICC is the most appropriate reliability statistic [37] and recommended besides the Bland-Altman plot [32]. To avoid confusing terminology, SEM, SDSC and ICC are presented in the terminology used in Shrout & Fleiss [28]. ICC and kappa, which are closely related [32,52], are interpreted according to Byrt's classification [53]. Graphics and statistical analyses were performed using Stata software, release 14.2 (Stata Corporation, College Station, TX, USA). As the American Statistical Association took a stand against Null Hypothesis Significance Testing [54,55], we present confidence intervals as recommended [56]. Because accuracy requires a large sample size [44], we looked for at least 200 observations as recommended [57].

Intra-rater variability
The agreement within the limits of ΔE < 2.7 was better for 2D than for 3D, both visually and electronically (Table 1). Figure 2 shows how the difference between two values of d(0M1) is related to ΔE, for which the difference between visual and electronical measurements was chosen. This difference in d(0M1) was strongly and substantially symmetrically related to ΔE ( Fig. 2; R 2 = 0.69 for 2D and R 2 = 0.59 for 3D). The agreement within the limits of d(0M1| < 2.7 was also better for 2D than for 3D, both visually and electronically ( Table 2). The limits of agreement were narrower for 2D elec than for the remaining three methods (Table 2; Fig. 3). The Bland-Altman plots show clear patterns of disagreement for all methods, which is most pronounced for 2D vis (Fig. 3). The d(0M1) range is narrowest for 2D vis (11.0) and widest for 3D elec (21.6) (Fig. 3); the variability of d(0M1) in terms of the pooled standard deviation is highest for 3D elec . The reliability in terms of the ICC is good to very good for d(0M1) ( Table 2).
The standard errors of measurement and SDCDs were essentially the same for the four methods, except for 2D elec , which was better ( Table 2). On the level of groups of observations or patient's teeth, the SDCD of 2D elec diminished from 2.8 for a single tooth to 1.4 and 1.0 for four and eight teeth, respectively. The SDCD of 2D vis decreased from 3.9 for a single tooth to 1.9 and 1.4 for four and eight teeth, respectively.

Inter-method variability
The comparability of visual and electronic measurements was fair to good in 2D and slight to fair in 3D for the agreement within the limits of ΔE < 2.7 ( Table 3). The corresponding agreement of 2D and 3D measurements was fair in the visual approach, and poor to slight in the electronic approach ( Table 3).
The comparability of visual and electronic measurements was good in 2D and fair in 3D for the agreement within the limits of |d(0M1)| < 2.7 ( Table 4). The  (Table 4).
Concerning the comparability of the visual and electronic measurements, the difference d 2d 1 , which indicates systematic error, was moderate in 2D and small in 3D (Table 4; Fig. 4). The Bland-Altman plots show marked patterns of disagreement for the approaches.
Concerning the comparability of 2D and 3D measurements, the difference d 2d 1 indicates systematic error, which was pronounced in the electronic approach (Table  4; Fig. 4). This difference can be interpreted as constant bias. Assuming proportional bias, the regression line can be cautiously interpreted. The Bland-Altman plots, however, showed clear patterns of disagreement for the approaches; the bias between the 2D and 3D system is neither constant nor uniquely proportional.
The reliability in terms of the ICC was fair to good for visual and electronic measurements. The reliability in terms of the ICC (3,1) , which ignores systematic differences, was good to very good for 2D and 3D measurements. The reliability in terms of the ICC (2,1) , which takes into account systematic differences, was poor to very good.

Discussion
The 2D system proved superior to the 3D system both visually and electronically in terms of ΔE and d(0M1) for statistics of agreement and reliability to assess intrarater variability. All four methods showed strong  Fig. 2 Scatter plot for the relationship between ΔE of the visual and electronic method and the difference of the distance from 0M1 between the visual and electronic method in 2D and 3D measurements; observations with the same coordinates are jittered to show their number patterns of disagreement between repeated measurements in Bland-Altman plots. As hypothesized, the 3D system is less reliable for hue than for lightness and chroma, a phenomenon which was more pronounced visually than electronically. The SDCD differs by the four methods used and was most favorable in the electronic 2D system. The agreement between the 2D and 3D systems in terms of ΔE was not good. It was lower in the electronic than in the visual method. The comparability of the 2D and 3D systems was uncertain, because confidence intervals of ICCs accounting for systematic error were wide. The systematic error between the 2D and 3D systems cannot be neglected. The reliability of the visual and electronic method was substantially the same in the 2D and 3D systems; this comparability was fair to good. Below, the following aspects are discussed: 2D and 3D, visual and electronic, ΔE and d(0M1), Bland-Altman plots and statistics (patterns and numbers), single shade designations of the 3D system, validity and reliability, statistical SDCD and known thresholds, agreement and reliability (comparability), human and machine, and intra-and inter-method variability.

2D and 3D systems
The 2D and 3D systems differ in the color space assessed [33]. Some 3D shades that are lighter (lightness) or stronger (chroma) are not well covered by the 2D system, which is especially pronounced for the additional bleaching shades available only in the 3D system. Compared to VC, hue ranges of 3D Master are extended toward yellow-red, and 3D Master shades are more uniformly spaced than that of VC [6]. In contrast, there are spatial gaps in the 3D system which are filled in the 2D system [33,41]. In short, both guides are suboptimal and can be improved [14].
The variability between raters may favor the 3D Master shade guide over the VC shade guide [58]. The coverage error favors the 3D system, although it is unclear whether the difference between the 2D and 3D systems is clinically relevant [12,14,[59][60][61]. However, the clear patterns in Bland-Altman plots for d(0M1) cast  doubt on the meaningfulness of converting 3D shades into VC shades (2D) as suggested elsewhere [62].

Visual and electronical method
The gaps mentioned above that are filled by the 2D system are supported by additional 2D shades to assess quarter-points for the second shade designation number [33], which is an important difference between the visual and electronic method. A further important difference is the extension of the second shade designation number from the visual four-point scale to the electronic fivepoint scale. Similarly, the electronic 3D system includes bleaching shades not used by the visual 3D system evaluated here. Thus, it could have been expected that a human rater is inferior to the electronic rater, especially for the 2D system. It is of note that the agreement of intrarater variability in terms of ΔE and d(0M1) is better for the visual 2D measurement than that for the electronic 3D measurement.  Several studies have found that instrumental methods are more accurate or reliable than visual measurements [11,19,[23][24][25][63][64][65]. A recent study, however, has shown that clinically relevant differences between the visual evaluation and the intraoral scanning device (3Shape) are negligible [20]. According to Li & Wang, the reliability of shade matching can be ensured neither by the instrumental nor by the visual approach [66]. Furthermore, the difference in color matching between human-eye assessment and computerized colorimetry depends on tooth type [18] and shade [8].

ΔE and d(0M1)
ΔE supports only statistics on agreement; neither Bland-Altman plots nor reliability statistics are feasible. Essentially, d(0M1) enables evaluating patterns of disagreement, other agreement statistics such as SDCD, and reliability statistics including versions of ICC accounting for systematic errors. Regarding agreement of repeated measurements of the same rater, the differences among the four methods are substantially the same for ΔE < 2.7 and d(0M1). The level of agreement within fixed limits, however, is higher for d(0M1). For example, d(0M1) hardly differentiates 3M1 from 2L2.5 (d(0M1): 15.2 and 15.3, respectively) although ΔE is 8.3. Thus, if lightness is compensated by less chroma (or chroma by darkness), then d(0M1) will not work well. The systematic errors between 2D and 3D measurements in d(0M1) are plausible, because the 2D and 3D systems differ in the color space assessed (see above). Within the 2D system, systematic errors between visual and electronic measurements are small, which can be explained by the additional quarter-point shades in the electronic 2D system.

Bland-Altman plots and statisticspatterns and numbers
According to Bland-Altman plots, bias between the 2D and 3D systems is neither constant nor uniquely proportional. Even if these kinds of bias could be adjusted foras suggested for uniquely proportional bias [48,49] the clear patterns are not appropriate for sophisticated statistical methods. Thus, Bland-Altman plots provide important information hardly available in numbers.

Single shade designations of the 3D system and d(0M1)
Although the reliability for the hue component of the visual 3D system is zero, the corresponding d(0M1) indicates good reliability. Likewise, the reliabilities are fair versus very good for the electronical 3D system, respectively. Thus, reliabilities of single shade designations can be misleading, especially for hue, for which ΔE values are only about 1.5 (see above). Nevertheless, the hue component of the 3D system is problematic, because its reliability is lower than those of lightness and chroma.

Validity and reliability
Colorimetry does not facilitate valid measurements. The value of d(0M1), however, supports pseudo-valid measurements, as the range of d(0M1) values differs across the four methods. The bleaching shades added to the electronic 3D system (not to the visual 3D system) make the difference: this range (21.6) is twice as high compared to visual 2D (11.0). Reliability in terms of the ICC depends on this rangeif the variability of d(0M1) is small, the ICC will be small. As expected, the pooled standard deviation of the electronic 3D system is higher than that of the electronic 2D system. The ICC of the electronic 3D system, however, is lower, which emphasizes the problems with the 3D systemindependent of human raters.

Smallest detectable color difference, acceptable and perceptible thresholds
An acceptability threshold of 2.7 in ΔE and a perceptibility threshold of 1.2 in ΔE are known [16]. The SDCD in terms of d(0M1) depends on the method and decreases from 2.8 to 1.0 for a row of eight teeth using electronic 2D measurements. These are statistical values and can differ from study to study. However, it is plausible that electronic 2D is the method with the best agreement, including SDCD. For properties of ΔE and d(0M1), electronic 2D is the recommended method for study designs with repeated measurements, such as longitudinal studies.

Agreement and reliability (comparability)
Whereas intra-rater agreement of repeated measurements in terms of SEM and SDCD does not differ between visual and electronic 3D measurements, the reliability or ICC differ substantially. Thus, a single human rater is not worse than the electronic device for a longitudinal study when using the 3D system. The comparability of the four methods remains uncertain. Therefore, the same method should also be used in multicenter studies.

Human and machine
Compared with a set of human raters, a set of devices from the same electronic system should have higher levels of standardization [67], which corresponds to the more favorable ICCs observed. However, n-of-1 trials, as used herein for the single human rater, limit generalizability. It may be further argued that the human rater lacks the ability to perceive hue [39]. But even if the examiner had lacked this ability, this would not have invalidated our conclusions, because we did not make an isolated statement on hue, but rather compared hue with lightness and chroma. These intra-human comparisons are supported by the n-of-1 trial design. Moreover, the same intra-device comparisons support the hypothesis that hue is not well reproducible; the electronic reliability of hue is merely fair. In addition to our findings, background knowledge further supports that 3D hue cannot be well assessed (see Introduction).

Intra-and inter-method variabilityvalidity revisited
Whereas the reliability within each of the four methods is good to very good, comparability of the visual and electronic measurements is only fair to good. This also questions the validity of visual and electronic measurements.
In turn, this question also refers to the difference between the 2D and 3D system. In fact, Bland-Altman plots using the 2D system suggest that both visual and electronic values are valid only for d(0M1) values of about 12 (A1 -A2, B1 -B2) and greater than 20 (A4, B3 -B4, C3 -C4, D4). The shades B1 and A2 are not well covered by the 3D system [33], which is mirrored in the corresponding Bland-Altman plots. Vice versa, 3D shades 1M1 and 1M2 (both d(0M1)<11.2 for the minimum of the 2D system) are not well covered by the 2D system [33] and question the validity of adjacent 2D shades, namely A1, B1, and B2. In daily practice, the 3D system may be useful for shades not available in the 2D system. Nevertheless, switching between methods cannot be recommended in scientific studies. The 3D system, however, can be favorable in bleaching studies owing to the added bleaching shades.

Conclusion
The 3D system may confuse both human raters and electronic devices. The 2D system is the simple and best choice.