Skip to main content

New insights in the reproducibility of visual and electronic tooth color assessment for dental practice



The aim of the study was to compare a 2D and 3D color system concerning a variety of statistical and graphical methods to assess validity and reliability of color measurements, and provide guidance on when to use which system and how to interpret color distance measures, including ΔE and d(0M1).


The color of teeth 14 to 24 of 35 patients undergoing regular bleaching treatment was visually assessed and electronically measured with the spectrophotometer Shade Inspector™. Tooth color was recorded before bleaching treatment, after 14 days, and again after 6 months. VITAPAN® Classical (2D) and VITA-3D-Master® (3D) served as reference systems.


Concerning repeated measurements, the 2D system was superior to the 3D system, both visually and electronically in terms of ΔE and d(OM1), for statistics of agreement and reliability. All four methods showed strong patterns in Bland-Altman plots. In the 3D system, hue was less reliable than lightness and chroma, which was more pronounced visually than electronically. The smallest detectable color difference varied among the four methods used, and was most favorable in the electronic 2D system. Comparing the methods, the agreement between the 2D and 3D system in terms of ΔE was not good. The reliability of the visual and electronic method was essentially the same in the 2D and 3D systems; this comparability is fair to good.

Clinical relevance

The 3D system may confuse human raters and even electronic devices. The 2D system is the simple and best choice.

Peer Review reports


Valid and reliable measurements of tooth color are of major importance in esthetic and restorative dentistry as well as in dental technical practice. Tooth color is usually described based on the Munsell color space in terms of hue, value, and chroma [1, 2]. Hue measures the basic color, value indicates the lightness of a color, and chroma measures the saturation or intensity of a color. Value is determined first, followed by chroma, yielding hue as the third dimension. One of the most important prerequisites is the assessment of tooth color either via visual comparison with prefabricated color scales or using measuring devices such as a colorimeter, spectrophotometer or digital imaging systems with corresponding software [3]. The most common method in clinical practice is still the visual method using VITAPAN® Classical shade guide, which is a 2D system. In 1998, the VITA 3D-Master® shade guide was launched on the dental market. It was developed to systematize color determination, thereby enhancing the likelihood of valid and reliable color measurements [4,5,6,7]. Concerning the systematic determination, however, an implicit prior belief about the VITA 3D Master® was not checked in developing this color guide: namely, that any two 3D shades within the same dimension at given constant shade values of the other two dimensions can be well differentiated by the human eye. In fact, dentists and dental technicians believe that the third dimension (hue) is problematic and that the distance between adjacent 3D shades is not large enough in this dimension. To quantify color differences, ΔE as the Euclidean distance between two points in the color space of the three dimensions (value, chroma, and hue) has been used in the majority of dental color studies [8,9,10,11,12,13,14,15,16,17,18,19,20], although a modification of ΔE is preferable [21]. However, numerous studies comparing visual and electronic methods have been published over the past decade [3, 8, 11, 18,19,20, 22,23,24,25,26,27].

Taking tooth color measurements is a complex process. In psychology and statistics, it is well known that repeated measurements [28, 29] or groups of observations such as on patients’ teeth increase reliability [30, 31]. Moreover, the favored ΔE to measure color differences cannot be applied to important graphical and statistical methods for the assessment of validity and reliability, including Bland-Altman plots to examine patterns of disagreement and the intraclass correlation coefficient (ICC) to estimate measurement variability [32]. These limitations can be overcome by using the distance of each shade from 0M1 of the 3D color system, denoted by d(0M1) [33]. Because d(0M1) does not distinguish shades of the same radius from M1, d(0M1) and ΔE are complementary rather than competing. For example, in studying bleaching effects, d(OM1) may be favorable for 0M1 but less favorable for comparing shades by gender and age groups (or to study whether the gender difference in tooth color increases with age). In general, validity depends on the purpose [34] and is to be redefined for every research question; there is no such thing as a universal gold standard [35, 36]. Likewise, choosing methods to assess reproducibility depends on the purpose [37]. Whereas reliability is often related to calibration or comparability of examiners before and during performance of large cross-sectional or multicenter studies (only one measurement per participant in the full-scale investigation), the smallest detectable difference or the smallest detectable change is sought in longitudinal studies (at least two measurements per participant; measurement error occured twice or more) [37], when the difference between repeated measurements is in the focus of interest. The smallest detectable difference or, in the present context, the smallest detectable color difference (SDCD), describes a statistical property and is different from perceptible or acceptable color difference thresholds. The SDCD of a row of teeth can easily be recalculated from the SDCD of a single tooth [31]. The SDCD may differ from method to method and from study to study; it contradicts the idea that color difference thresholds are universally valid. In other words, the concept of a universal color difference threshold is scientifically misleading because it confuses validity and reliability. Moreover, color metrics are arbitrary, color perception is subjective, and acceptable color shade differences vary among different colors (ΔE: 1.1 among red shades and 2.1 among yellow shades) [38]. Despite these limitations of color science, it can serve as a rough guide for color difference thresholds and may be useful in daily tooth color determination in dentistry. Therefore, different aspects must be considered when comparing the conventional 2D system with the newer 3D system. This seems more reasonable, because it is more ordered. Ordering alone, however, may not be enough, because the human or electronic rater must have the chance to measure reliably. Whereas directly adjacent shades of the 3D system have mean ΔE values of about 3.8 for lightness (1M1-2M1-3M1) and 4.4 for chroma (2M1–2M2–2M3), the mean ΔE value is only about 1.5 for the six direct neighbors of hue (2L1.5–2R1.5;2L2.5–2R2.5) [38].

Thus, it can be hypothesized that hue is measured less reliably than lightness or chroma. This can be examined not only for an electronic rater but also for a human rater; within-subject comparisons are justified because the examiner serves as her/his own control (hue as exposure versus lightness or chroma as reference), similar to n-of-1 trials [39].

The aim of this study was to compare the 2D and the 3D color systems concerning a variety of statistical and graphical methods to assess validity and reliability, as well as to provide guidance on when to use which system and how to interpret ΔE and d0M1.

Materials and methods

Subjects and clinical procedure

In order to better assess clinically relevant color changes, color measurements were performed in patients receiving a regular in-office bleaching treatment (BT). The tooth-inclusion criteria for performing BT were no caries, endodontic treatment or restorations. Patients with insufficient oral hygiene, previous BT, periodontal disease, pregnancy, and allergy or hypersensitivity to the bleaching agents were excluded. The study was approved by the ethics committee of the Medical Association (Ärtzekammer) of Mecklenburg-Vorpommern (Reg. Nr.III UV 15/08). All patients gave informed consent. Thirty-five patients (24 women, 11 men, average age 30 years) from the Dental Clinic at the University of Greifswald participated. The complete clinical procedure was performed under standardized conditions according to the standardized clinical protocol for in-office bleaching under the supervision of an experienced dentist (AW). The bleaching procedure was performed on teeth 15 to 25 and 35 to 45. Supra- and subgingival plaque, stains and calculus were removed, and all teeth were polished with non-fluoridated, oil-free pumice before bleaching.

The gingiva was protected by a liquid gingiva protectant (Dental Dam, Schütz Dental, Rosbach, Germany) activated by a light-curing unit (Ortholux TM LED Lurnig Light, Fa. 3 M Unitek). Bleach’n Smile, 35% H2O2, (Schütz Dental, Rosbach, Germany) was applied three times for 10 minutes according to the manufacturer’s recommendation.”

After bleaching, all teeth were fluoridated with Elmex® gelée (CP GABA, Germany).

Visual and electronic color assessment

The color of labial surfaces of teeth 14 to 24 was visually assessed by an experienced dental technician, who was ophthalmologically examined before this study [40], under diffuse daylight between 11 a.m. and 3 p.m. The time needed for color assessment was not restricted. Electronic measurements were performed with the spectrophotometer Shade Inspector™ (Schütz-Dental, Rosbach, Germany) by a dentist calibrated prior to this study [40]. The color systems VITAPAN® Classical (2D-VC; VITA Zahnfabrik, Bad Säckingen, Germany) and VITA 3D-Master® (3D; VITA Zahnfabrik, Bad Säckingen, Germany) served as reference systems. The VC color system has a two-dimensional structure that enables the description of hue (category A to D) and lightness including chroma (group 1 to 4) [41]. It serves as the standard shade guide for visual color assessment in dental practice. The 3D color system has a three-dimensional structure that enables the separate description of lightness (1 to 5 and 0 for bleaching), chroma (1 to 3, including half points), and hue (M, L, R) [42]. For the measurement procedure, each tooth was categorized into the gingival (S1), the body (S2), and the incisal (S3) segment. The incisal segment S3 was not included in the analysis due to its transparency. Measurements were carried out as described in the previous study [33]. Time points of visual and electronic measurements were before BT (T1/T2- Baseline), 14 days (T3/T4) and 6 months (T5/T6) after BT (Fig. 1).

Fig. 1
figure 1

Consort Flow Diagram

Statistical methods

ΔE = ((ΔL*)2 + (Δa*)2 + (Δb*)2)1/2 and ΔE00 [43] were calculated. ΔE00 is superior to ΔE, but its calculation is quite sophisticated. Irregularities in the color space are corrected as follows: 1. the differences in the individual dimensions are calculated; 2. weighting is carried out; 3. finally a term for the interaction between the chroma differences and the hue differences is added; the calculation includes 22 lines of formulae [43]. ΔE00 values are usually smaller than those of ΔE [21]. Here, we focused on ΔE because it is more commonly used. The Bland-Altman plot [44] is one of the most frequently cited methods in medicine. Although several adaptations have been discussed [45,46,47,48,49], we present only the classical plot with the mean difference and the limits of agreement for d(0M1), which is ΔE of each shade from 0M1. For method comparisons, but not for intra-rater comparisons, the regression line was added. Out of 840 paired observations, a total of 30–55 observations can be expected to be outside the limits of agreement according to M. Bland [50]. Besides the limits of agreement (difference between measurements ±1.96* standard deviation of the difference [44]), we present the agreement within 2.7 [16] and 3.7 [51] units of d(0M1) and ΔE. These agreement statistics and the difference between the pairs of observations (denoted by d2 – d1 for d(0M1), including standard deviation, are the only measurement error statistics also reported for ΔE. The standard error of measurement (SEM) is a further agreement statistic and reported in two versions [37], for which the values are very similar herein. The SDCD is defined as 1.96*√2*SEM ≈ 2.77*SEM [37]. The SDSC on the level of groups of observations or patient’s teeth is calculated according de Vet et al. 2001 [31]. In addition to agreement statistics, which are related to differences of repeated measurements, we present reliability statistics, which are related to calibration or comparability of raters or methods [34]. The fraction of the total measurement variance due to variance among teeth is estimated by three versions of the intraclass correlation coefficient (ICC) [28]. Whereas the ICC(3,1) ignores systematic differences between the two methods, raters, or measurements of the same rater, the ICC(2,1) includes an additional term of the variance among raters to account for the total measurement variance (denominator) [28, 37]. Thus, the greater the systematic difference between two raters, the smaller the ICC(2,1) compared with the ICC(3,1). The ICC is the most appropriate reliability statistic [37] and recommended besides the Bland-Altman plot [32]. To avoid confusing terminology, SEM, SDSC and ICC are presented in the terminology used in Shrout & Fleiss [28]. ICC and kappa, which are closely related [32, 52], are interpreted according to Byrt’s classification [53]. Graphics and statistical analyses were performed using Stata software, release 14.2 (Stata Corporation, College Station, TX, USA). As the American Statistical Association took a stand against Null Hypothesis Significance Testing [54, 55], we present confidence intervals as recommended [56]. Because accuracy requires a large sample size [44], we looked for at least 200 observations as recommended [57].


Intra-rater variability

The agreement within the limits of ΔE < 2.7 was better for 2D than for 3D, both visually and electronically (Table 1). Figure 2 shows how the difference between two values of d(0M1) is related to ΔE, for which the difference between visual and electronical measurements was chosen. This difference in d(0M1) was strongly and substantially symmetrically related to ΔE (Fig. 2; R2 = 0.69 for 2D and R2 = 0.59 for 3D). The agreement within the limits of d(0M1| < 2.7 was also better for 2D than for 3D, both visually and electronically (Table 2). The limits of agreement were narrower for 2Delec than for the remaining three methods (Table 2; Fig. 3). The Bland-Altman plots show clear patterns of disagreement for all methods, which is most pronounced for 2Dvis (Fig. 3). The d(0M1) range is narrowest for 2Dvis (11.0) and widest for 3Delec (21.6) (Fig. 3); the variability of d(0M1) in terms of the pooled standard deviation is highest for 3Delec. The reliability in terms of the ICC is good to very good for d(0M1) (Table 2).

Table 1 Agreement of repeated measurements for four methods in terms of ΔE and ΔE00 related to a single tooth
Fig. 2
figure 2

Scatter plot for the relationship between ΔE of the visual and electronic method and the difference of the distance from 0M1 between the visual and electronic method in 2D and 3D measurements; observations with the same coordinates are jittered to show their number

Table 2 Agreement and reliability of repeated measurements for four methods in terms of the distance from 0M1 related to a single tooth
Fig. 3
figure 3

Bland-Altman plots for the distance from 0M1 (body surface); observations with the same coordinates are jittered to show their number

As hypothesized for the three single dimensions, hue is less reliable than lightness or chroma, both electronically (Kappa value for hue = 0.45, 95% CI: 0.40–0.50; ICC(1,1) for lightness = 0.76, 95% CI: 0.74–0.79; ICC(1,1) for chroma = 0.67, 95% CI: 0.63–0.70) and visually (Kappa value for hue = 0.01, 95% CI: − 0.05 – 0.06; ICC(1,1) for lightness = 0.52, 95% CI: 0.47–0.57; ICC(1,1) for chroma = 0.66, 95% CI: 0.62–0.69).

The standard errors of measurement and SDCDs were essentially the same for the four methods, except for 2Delec, which was better (Table 2). On the level of groups of observations or patient’s teeth, the SDCD of 2Delec diminished from 2.8 for a single tooth to 1.4 and 1.0 for four and eight teeth, respectively. The SDCD of 2Dvis decreased from 3.9 for a single tooth to 1.9 and 1.4 for four and eight teeth, respectively.

Inter-method variability

The comparability of visual and electronic measurements was fair to good in 2D and slight to fair in 3D for the agreement within the limits of ΔE < 2.7 (Table 3). The corresponding agreement of 2D and 3D measurements was fair in the visual approach, and poor to slight in the electronic approach (Table 3).

Table 3 Comparing methods of measurements in terms of ΔE and ΔE00: 2D versus 3D within visual or electronical measurement; visual versus electronical measurements within 2D and 3D

The comparability of visual and electronic measurements was good in 2D and fair in 3D for the agreement within the limits of |d(0M1)| < 2.7 (Table 4). The corresponding agreement of 2D and 3D measurements was good in the visual approach, and fair in the electronic approach (Table 4).

Table 4 Comparing methods of measurements of the distance from 0M1 related to a single tooth: 2D versus 3D within visual or electronical measurement; visual versus electronical measurements within 2D and 3D

Concerning the comparability of the visual and electronic measurements, the difference d2 – d1, which indicates systematic error, was moderate in 2D and small in 3D (Table 4; Fig. 4). The Bland-Altman plots show marked patterns of disagreement for the approaches.

Fig. 4
figure 4

Bland-Altman plots for the distance from 0M1 (body surface); observations with the same coordinates are jittered to show their number

Concerning the comparability of 2D and 3D measurements, the difference d2 – d1 indicates systematic error, which was pronounced in the electronic approach (Table 4; Fig. 4). This difference can be interpreted as constant bias. Assuming proportional bias, the regression line can be cautiously interpreted. The Bland-Altman plots, however, showed clear patterns of disagreement for the approaches; the bias between the 2D and 3D system is neither constant nor uniquely proportional.

The reliability in terms of the ICC was fair to good for visual and electronic measurements. The reliability in terms of the ICC(3,1), which ignores systematic differences, was good to very good for 2D and 3D measurements. The reliability in terms of the ICC(2,1), which takes into account systematic differences, was poor to very good.


The 2D system proved superior to the 3D system both visually and electronically in terms of ΔE and d(0M1) for statistics of agreement and reliability to assess intra-rater variability. All four methods showed strong patterns of disagreement between repeated measurements in Bland-Altman plots. As hypothesized, the 3D system is less reliable for hue than for lightness and chroma, a phenomenon which was more pronounced visually than electronically. The SDCD differs by the four methods used and was most favorable in the electronic 2D system. The agreement between the 2D and 3D systems in terms of ΔE was not good. It was lower in the electronic than in the visual method. The comparability of the 2D and 3D systems was uncertain, because confidence intervals of ICCs accounting for systematic error were wide. The systematic error between the 2D and 3D systems cannot be neglected. The reliability of the visual and electronic method was substantially the same in the 2D and 3D systems; this comparability was fair to good.

Below, the following aspects are discussed: 2D and 3D, visual and electronic, ΔE and d(0M1), Bland-Altman plots and statistics (patterns and numbers), single shade designations of the 3D system, validity and reliability, statistical SDCD and known thresholds, agreement and reliability (comparability), human and machine, and intra- and inter-method variability.

2D and 3D systems

The 2D and 3D systems differ in the color space assessed [33]. Some 3D shades that are lighter (lightness) or stronger (chroma) are not well covered by the 2D system, which is especially pronounced for the additional bleaching shades available only in the 3D system. Compared to VC, hue ranges of 3D Master are extended toward yellow-red, and 3D Master shades are more uniformly spaced than that of VC [6]. In contrast, there are spatial gaps in the 3D system which are filled in the 2D system [33, 41]. In short, both guides are suboptimal and can be improved [14].

The variability between raters may favor the 3D Master shade guide over the VC shade guide [58]. The coverage error favors the 3D system, although it is unclear whether the difference between the 2D and 3D systems is clinically relevant [12, 14, 59,60,61]. However, the clear patterns in Bland-Altman plots for d(0M1) cast doubt on the meaningfulness of converting 3D shades into VC shades (2D) as suggested elsewhere [62].

Visual and electronical method

The gaps mentioned above that are filled by the 2D system are supported by additional 2D shades to assess quarter-points for the second shade designation number [33], which is an important difference between the visual and electronic method. A further important difference is the extension of the second shade designation number from the visual four-point scale to the electronic five-point scale. Similarly, the electronic 3D system includes bleaching shades not used by the visual 3D system evaluated here. Thus, it could have been expected that a human rater is inferior to the electronic rater, especially for the 2D system. It is of note that the agreement of intra-rater variability in terms of ΔE and d(0M1) is better for the visual 2D measurement than that for the electronic 3D measurement.

Several studies have found that instrumental methods are more accurate or reliable than visual measurements [11, 19, 23,24,25, 63,64,65]. A recent study, however, has shown that clinically relevant differences between the visual evaluation and the intraoral scanning device (3Shape) are negligible [20]. According to Li & Wang, the reliability of shade matching can be ensured neither by the instrumental nor by the visual approach [66]. Furthermore, the difference in color matching between human-eye assessment and computerized colorimetry depends on tooth type [18] and shade [8].

ΔE and d(0M1)

ΔE supports only statistics on agreement; neither Bland-Altman plots nor reliability statistics are feasible. Essentially, d(0M1) enables evaluating patterns of disagreement, other agreement statistics such as SDCD, and reliability statistics including versions of ICC accounting for systematic errors. Regarding agreement of repeated measurements of the same rater, the differences among the four methods are substantially the same for ΔE < 2.7 and d(0M1). The level of agreement within fixed limits, however, is higher for d(0M1). For example, d(0M1) hardly differentiates 3M1 from 2L2.5 (d(0M1): 15.2 and 15.3, respectively) although ΔE is 8.3. Thus, if lightness is compensated by less chroma (or chroma by darkness), then d(0M1) will not work well. The systematic errors between 2D and 3D measurements in d(0M1) are plausible, because the 2D and 3D systems differ in the color space assessed (see above). Within the 2D system, systematic errors between visual and electronic measurements are small, which can be explained by the additional quarter-point shades in the electronic 2D system.

Bland-Altman plots and statistics – patterns and numbers

According to Bland-Altman plots, bias between the 2D and 3D systems is neither constant nor uniquely proportional. Even if these kinds of bias could be adjusted for as suggested for uniquely proportional bias [48, 49] the clear patterns are not appropriate for sophisticated statistical methods. Thus, Bland-Altman plots provide important information hardly available in numbers.

Single shade designations of the 3D system and d(0M1)

Although the reliability for the hue component of the visual 3D system is zero, the corresponding d(0M1) indicates good reliability. Likewise, the reliabilities are fair versus very good for the electronical 3D system, respectively. Thus, reliabilities of single shade designations can be misleading, especially for hue, for which ΔE values are only about 1.5 (see above). Nevertheless, the hue component of the 3D system is problematic, because its reliability is lower than those of lightness and chroma.

Validity and reliability

Colorimetry does not facilitate valid measurements. The value of d(0M1), however, supports pseudo-valid measurements, as the range of d(0M1) values differs across the four methods. The bleaching shades added to the electronic 3D system (not to the visual 3D system) make the difference: this range (21.6) is twice as high compared to visual 2D (11.0). Reliability in terms of the ICC depends on this range – if the variability of d(0M1) is small, the ICC will be small. As expected, the pooled standard deviation of the electronic 3D system is higher than that of the electronic 2D system. The ICC of the electronic 3D system, however, is lower, which emphasizes the problems with the 3D system – independent of human raters.

Smallest detectable color difference, acceptable and perceptible thresholds

An acceptability threshold of 2.7 in ΔE and a perceptibility threshold of 1.2 in ΔE are known [16]. The SDCD in terms of d(0M1) depends on the method and decreases from 2.8 to 1.0 for a row of eight teeth using electronic 2D measurements. These are statistical values and can differ from study to study. However, it is plausible that electronic 2D is the method with the best agreement, including SDCD. For properties of ΔE and d(0M1), electronic 2D is the recommended method for study designs with repeated measurements, such as longitudinal studies.

Agreement and reliability (comparability)

Whereas intra-rater agreement of repeated measurements in terms of SEM and SDCD does not differ between visual and electronic 3D measurements, the reliability or ICC differ substantially. Thus, a single human rater is not worse than the electronic device for a longitudinal study when using the 3D system. The comparability of the four methods remains uncertain. Therefore, the same method should also be used in multicenter studies.

Human and machine

Compared with a set of human raters, a set of devices from the same electronic system should have higher levels of standardization [67], which corresponds to the more favorable ICCs observed. However, n-of-1 trials, as used herein for the single human rater, limit generalizability. It may be further argued that the human rater lacks the ability to perceive hue [39]. But even if the examiner had lacked this ability, this would not have invalidated our conclusions, because we did not make an isolated statement on hue, but rather compared hue with lightness and chroma. These intra-human comparisons are supported by the n-of-1 trial design. Moreover, the same intra-device comparisons support the hypothesis that hue is not well reproducible; the electronic reliability of hue is merely fair. In addition to our findings, background knowledge further supports that 3D hue cannot be well assessed (see Introduction).

Intra- and inter-method variability – validity revisited

Whereas the reliability within each of the four methods is good to very good, comparability of the visual and electronic measurements is only fair to good. This also questions the validity of visual and electronic measurements. In turn, this question also refers to the difference between the 2D and 3D system. In fact, Bland-Altman plots using the 2D system suggest that both visual and electronic values are valid only for d(0M1) values of about 12 (A1 – A2, B1 – B2) and greater than 20 (A4, B3 – B4, C3 – C4, D4). The shades B1 and A2 are not well covered by the 3D system [33], which is mirrored in the corresponding Bland-Altman plots. Vice versa, 3D shades 1M1 and 1M2 (both d(0M1)<11.2 for the minimum of the 2D system) are not well covered by the 2D system [33] and question the validity of adjacent 2D shades, namely A1, B1, and B2. In daily practice, the 3D system may be useful for shades not available in the 2D system. Nevertheless, switching between methods cannot be recommended in scientific studies. The 3D system, however, can be favorable in bleaching studies owing to the added bleaching shades.


The 3D system may confuse both human raters and electronic devices. The 2D system is the simple and best choice.

Availability of data and materials

All data are available on request at the Department of Orthodontics, Dental School, University Medicine, Walther-Rathenau-Strasse 42, 17475 Greifswald, Germany.


2Dvis :


2Delec :


3Dvis :


3Delec :



  1. Terry DA, Geller W, Tric O, Anderson MJ, Tourville M, Kobashigawa A. Anatomical form defines color: function, form, and aesthetics. Pract Proced Aesthet Dent. 2002;14:59–67 quiz 68.

    PubMed  Google Scholar 

  2. Sikri VK. Color: implications in dentistry. J Conserv Dent. 2010;13:249–55.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Chu SJ, Trushkowsky RD, Paravina RD. Dental color matching instruments and systems. Review of clinical and research aspects. J Dent. 2010;38(Suppl 2):e2–16.

    Article  PubMed  Google Scholar 

  4. Paravina RD, Powers JM, Fay RM. Dental color standards: shade tab arrangement. J Esthet Restor Dent. 2001;13:254–63.

    Article  CAS  PubMed  Google Scholar 

  5. Paravina RD, Majkic G, Imai FH, Powers JM. Optimization of tooth color and shade guide design. J Prosthodont. 2007;16:269–76.

    Article  PubMed  Google Scholar 

  6. Paravina RD, Powers JM, Fay RM. Color comparison of two shade guides. Int J Prosthodont. 2002;15:73–8.

    PubMed  Google Scholar 

  7. Paravina RD. Performance assessment of dental shade guides. J Dent. 2009;37(Suppl 1):e15–20.

    Article  PubMed  Google Scholar 

  8. Yap AU, Sim CP, Loh WL, Teo JH. Human-eye versus computerized color matching. Oper Dent. 1999;24:358–63.

    CAS  PubMed  Google Scholar 

  9. Sim CP, Yap AU, Teo J. Color perception among different dental personnel. Oper Dent. 2001;26:435–9.

    CAS  PubMed  Google Scholar 

  10. Wee AG, Monaghan P, Johnston WM. Variation in color between intended matched shade and fabricated shade of dental porcelain. J Prosthet Dent. 2002;87:657–66.

    Article  PubMed  Google Scholar 

  11. Paul S, Peter A, Pietrobon N, Hammerle CH. Visual and spectrophotometric shade analysis of human teeth. J Dent Res. 2002;81:578–82.

    Article  CAS  PubMed  Google Scholar 

  12. Li Q, Yu H, Wang YN. In vivo spectroradiometric evaluation of colour matching errors among five shade guides. J Oral Rehabil. 2009;36:65–70.

    Article  CAS  PubMed  Google Scholar 

  13. Hassel AJ, Cevirgen E, Balke Z, Rammelsberg P. Intraexaminer reliability of measurement of tooth color by spectrophotometry. Quintessence Int. 2009;40:421–6.

    PubMed  Google Scholar 

  14. Cocking C, Cevirgen E, Helling S, Oswald M, Corcodel N, Rammelsberg P, Reinelt G, Hassel AJ. Colour compatibility between teeth and dental shade guides in Quinquagenarians and septuagenarians. J Oral Rehabil. 2009;36:848–55.

    Article  CAS  PubMed  Google Scholar 

  15. Olms C, Setz JM. The repeatability of digital shade measurement--a clinical study. Clin Oral Investig. 2013;17:1161–6.

    Article  PubMed  Google Scholar 

  16. Paravina RD, Ghinea R, Herrera LJ, Bona AD, Igiel C, Linninger M, Sakai M, Takahashi H, Tashkandi E, Perez MM. Color difference thresholds in dentistry. J Esthet Restor Dent. 2015;27(Suppl 1):S1–9.

    Article  PubMed  Google Scholar 

  17. Knezovic D, Zlataric D, Illes IZ, Alajbeg M, Zagar. In vivo evaluations of inter-observer reliability using VITA Easyshade(R) advance 4.0 dental shade-matching device. Acta Stomatol Croat. 2016;50:34–9.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Igiel C, Weyhrauch M, Wentaschek S, Scheller H, Lehmann KM. Dental color matching: a comparison between visual and instrumental methods. Dent Mater J. 2016;35:63–9.

    Article  CAS  PubMed  Google Scholar 

  19. Lehmann K, Devigus A, Wentaschek S, Igiel C, Scheller H, Paravina R. Comparison of visual shade matching and electronic color measurement device. Int J Esthet Dent. 2017;12:396–404.

    PubMed  Google Scholar 

  20. Mehl A, Bosch G, Fischer C, Ender A. In vivo tooth-color measurement with a new 3D intraoral scanning system in comparison to conventional digital and visual color determination methods. Int J Comput Dent. 2017;20:343–61.

    PubMed  Google Scholar 

  21. Gomez-Polo C, Portillo Munoz M, Lorenzo Luengo MC, Vicente P, Galindo P, Martin Casado AM. Comparison of two color-difference formulas using the Bland-Altman approach based on natural tooth color space. J Prosthet Dent. 2016;115:482–8.

    Article  PubMed  Google Scholar 

  22. van der Burgt TP, ten Bosch JJ, Borsboom PC, Kortsmit WJ. A comparison of new and conventional methods for quantification of tooth color. J Prosthet Dent. 1990;63:155–62.

    Article  PubMed  Google Scholar 

  23. Fani G, Vichi A, Davidson CL. Spectrophotometric and visual shade measurements of human teeth using three shade guides. Am J Dent. 2007;20:142–6.

    PubMed  Google Scholar 

  24. Judeh A, Al-Wahadni A. A comparison between conventional visual and spectrophotometric methods for shade selection. Quintessence Int. 2009;40:e69–79.

    PubMed  Google Scholar 

  25. Browning WD, Chan DC, Blalock JS, Brackett MG. A comparison of human raters and an intra-oral spectrophotometer. Oper Dent. 2009;34:337–43.

    Article  PubMed  Google Scholar 

  26. Gomez-Polo C, Gomez-Polo M, Celemin-Vinuela A, Martinez Vazquez De Parga JA. Differences between the human eye and the spectrophotometer in the shade matching of tooth colour. J Dent. 2014;42:742–5.

    Article  PubMed  Google Scholar 

  27. Parameswaran V, Anilkumar S, Lylajam S, Rajesh C, Narayan V. Comparison of accuracies of an intraoral spectrophotometer and conventional visual method for shade matching using two shade guide systems. J Indian Prosthodont Soc. 2016;16:352–8.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86:420–8.

    Article  CAS  PubMed  Google Scholar 

  29. Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement error in nonlinear models. A modern perspective. 2nd ed. Boca Raton: Chapman & Hall/CRC; 2006.

    Book  Google Scholar 

  30. Altman DG, Bland JM. Statistics notes - standard deviations and standard errors. Br Med J. 2005;331:903.

    Article  Google Scholar 

  31. de Vet HCW, Bouter LM, Bezemer PD, Beurskens AJHM. Reproducibility and responsiveness of evaluative outcome measures - theoretical considerations illustrated by an empirical example. Int J Technol Assess Health Care. 2001;17:479–87.

    Article  PubMed  Google Scholar 

  32. Szklo M, Nieto FJ. Epidemiology. Beyond the basics. 3rd edn. Burlington: Jones & Bartlett Learning; 2014.

    Google Scholar 

  33. Ratzmann A, Schwahn C, Treichel A, Faltermeier A, Welk A. Assessing the effect of multibracket appliance treatment on tooth color by using electronic measurement. Head Face Med. 2018;14:22.

    Article  PubMed  PubMed Central  Google Scholar 

  34. Porta M, Greenland S, Hernán M, dos Santos SI, Last JM, Burón A. A dictionary of epidemiology. 6th ed. Oxford: Oxford University Press; 2014.

    Book  Google Scholar 

  35. Gigerenzer G, Marewski JN. Surrogate science: the idol of a universal method for scientific inference. J Manag. 2015;41:421–40.

    Google Scholar 

  36. Greenland S. Invited commentary: the need for cognitive science in methodology. Am J Epidemiol. 2017;186:639–45.

    Article  PubMed  Google Scholar 

  37. de Vet HCW, Terwee CB, Knol DL, Bouter LM. When to use agreement versus reliability measures. J Clin Epidemiol. 2006;59:1033–9.

    Article  PubMed  Google Scholar 

  38. Ahn JS, Lee YK. Color distribution of a shade guide in the value, chroma, and hue scale. J Prosthet Dent. 2008;100:18–28.

    Article  PubMed  Google Scholar 

  39. Senn S. Statistical issues in drug development. Chichester: Wiley; 2007.

    Book  Google Scholar 

  40. Ratzmann A, Klinke T, Schwahn C, Treichel A, Gedrange T. Reproducibility of electronic tooth colour measurements. Biomed Tech (Berl). 2008;53:259–63.

    Article  Google Scholar 

  41. Park JH, Lee YK, Lim BS. Influence of illuminants on the color distribution of shade guides. J Prosthet Dent. 2006;96:402–11.

    Article  PubMed  Google Scholar 

  42. Vita Z. Dental Shade Guides. In J Am Dent Assc. 2002;133:366–7.

    Article  Google Scholar 

  43. Sharma G, Wu WC, Daa EN. The CIEDE2000 color-difference formula: implementation notes, supplementary test data, and mathematical observations. Color Res Appl. 2005;30:21–30.

    Article  Google Scholar 

  44. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1:307–10.

    Article  CAS  PubMed  Google Scholar 

  45. Bland JM, Altman DG. Comparing methods of measurement: why plotting difference against standard method is misleading. Lancet. 1995;346:1085–7.

    Article  CAS  PubMed  Google Scholar 

  46. Krouwer JS. Why Bland-Altman plots should use X, not (Y+X)/2 when X is a reference method. Stat Med. 2008;27:778–80.

    Article  PubMed  Google Scholar 

  47. Carstensen B. Comparing methods of measurement: extending the LoA by regression. Stat Med. 2010;29:401–10.

    Article  PubMed  Google Scholar 

  48. Taffé P, Peng M, Stagg V, Williamson T. MethodCompare: an R package to assess bias and precision in method comparison studies. Stat Methods Med Res. 2019;28(8):2557–65.

  49. Taffe P, Peng MK, Stagg V, Williamson T. Biasplot: a package to effective plots to assess bias and precision in method comparison studies. Stata J. 2017;17:208–21.

    Article  Google Scholar 

  50. Bland M. Should all my observations lie between the limits of agreement? 2004.

    Google Scholar 

  51. Johnston WM, Kao EC. Assessment of appearance match by visual observation and clinical colorimetry. J Dent Res. 1989;68:819–22.

    Article  CAS  PubMed  Google Scholar 

  52. Fleiss JL. Statistical methods for rates and proportions. 2nd ed. New York: Wiley; 1981.

    Google Scholar 

  53. Byrt T. How good is that agreement? Epidemiology. 1996;7:561.

    Article  CAS  PubMed  Google Scholar 

  54. Wasserstein RL, Assoc AS. ASA statement on statistical significance and P-values. Am Stat. 2016;70:131–3.

    Article  Google Scholar 

  55. Wasserstein RL, Lazar NA. The ASA's statement on p-values: context, process, and purpose. Am Stat. 2016;70:129–31.

    Article  Google Scholar 

  56. Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31:337–50.

    Article  PubMed  PubMed Central  Google Scholar 

  57. How can I decide the sample size for a study of agreement between two methods of measurement? 2004. Last accessed October 5, 2020.

  58. Oh WS, Koh IW, O'Brien WJ. Estimation of visual shade matching errors with 2 shade guides. Quintessence Int. 2009;40:833–6.

    PubMed  Google Scholar 

  59. Hassel AJ, Koke U, Schmitter M, Beck J, Rammelsberg P. Clinical effect of different shade guide systems on the tooth shades of ceramic-veneered restorations. Int J Prosthodont. 2005;18:422–6.

    PubMed  Google Scholar 

  60. Bayindir F, Kuo S, Johnston WM, Wee AG. Coverage error of three conceptually different shade guide systems to vital unrestored dentition. J Prosthet Dent. 2007;98:175–85.

    Article  PubMed  PubMed Central  Google Scholar 

  61. Ongul D, Sermet B, Balkaya MC. Visual and instrumental evaluation of color match ability of 2 shade guides on a ceramic system. J Prosthet Dent. 2012;108:9–14.

    Article  PubMed  Google Scholar 

  62. Zenthofer A, Wiesberg S, Hildenbrandt A, Reinelt G, Rammelsberg P, Hassel AJ. Selecting VITA classical shades with the VITA 3D-master shade guide. Int J Prosthodont. 2014;27:376–82.

    Article  PubMed  Google Scholar 

  63. Paul SJ, Peter A, Rodoni L, Pietrobon N. Conventional visual vs spectrophotometric shade taking for porcelain-fused-to-metal crowns: a clinical comparison. Int J Periodontics Restorative Dent. 2004;24:222–31.

    PubMed  Google Scholar 

  64. Pimentel W, Tiossi R. Comparison between visual and instrumental methods for natural tooth shade matching. Sci Color. 2014:47–9.

  65. Liberato WF, Barreto IC, Costa PP, de Almeida CC, Pimentel W, Tiossi R. A comparison between visual, intraoral scanner, and spectrophotometer shade matching: a clinical study. J Prosthet Dent. 2019;121:271–5.

    Article  PubMed  Google Scholar 

  66. Li Q, Wang YN. Comparison of shade matching by visual observation and an intraoral dental colorimeter. J Oral Rehabil. 2007;34:848–54.

    Article  CAS  PubMed  Google Scholar 

  67. Todorovic A, Todorovic A, Gostovic AS, Lazic V, Milicic B, Djurisic S. Reliability of conventional shade guides in teeth color determination. Vojnosanit Pregl. 2013;70:929–34.

    Article  PubMed  Google Scholar 

Download references


Not applicable.

Informed consent

Informed consent was obtained individually from all participants included in the study.

Explanation of any issues relating to journal policies

No issues.

Confirmation that all authors have read and approved the manuscript

All authors have read and approved the manuscript.

Confirmation that the content of the manuscript has not been published or submitted for publication elsewhere

The authors confirm that the content of the manuscript has not been published or submitted for publication elsewhere.


The study was not funded. Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations



AR: contributed to design, recruiting participants, analysis and interpretation, writing of manuscript, AW: contributed to design, supervision clinical treatment, analysis and interpretation, revising the manuscript, JF: critically revised the manuscript,. StH: contributed to data acquisition. CS: contributed to statistical analysis and interpretation, writing statistical section of manuscript. All authors gave final approval and agreed to be accountable for all aspects of the work.

Corresponding author

Correspondence to Anja Ratzmann.

Ethics declarations

Ethics approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional research committee of the Ärztekammer Mecklenburg-Vorpommern (Reg. Nr.III UV 15/08) and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Consent for publication

Not applicable.

Competing interests

All authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ratzmann, A., Welk, A., Hoppe, S. et al. New insights in the reproducibility of visual and electronic tooth color assessment for dental practice. Head Face Med 16, 37 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: