Inter-rater reliability (IRR): A Valuable Metric for Qualitative Research
Most people who do qualitative research, which analyzes non-numerical information, such as interviews, open-ended questionnaires, and observations, know that it includes a lot of coding. Coding is a standardized process of classifying qualitative data (i.e., non-numerical data) by using one unified model, or “coding scheme,” to analyze numerous sets of data. Coding aims to reduce subjective opinion and analysis of qualitative data, and instead ensure a more objective analytical process with limited bias in results.
Coding qualitative data into both narrow and broad themes via the coding scheme is the best way to classify non-numerical participant responses. Next, it is imperative to obtain the inter-rater reliability (IRR) of the completed codes by finding the value of IRR’s associated statistics (percent agreement and kappa score). Finding IRR through the percent agreement and kappa score is the main method to distinguish if the coding scheme is successfully classifying the data, and also if the coders are consistently applying the scheme to the data.
Reason 1: Inter-rater reliability (IRR) tells you how similar or dissimilar you are coding your data.
Say you have three researchers coding 20 different interviews for a qualitative project, but as the project manager you want to know whether all three researchers are consistently arriving at the same results before you report them. This is where inter-rater reliability (IRR) comes in. IRR is the set of preferred statistics (usually percent agreement and kappa score) you can look at to indicate how similar or dissimilar your data is being coded. IRR is important to know since it gives you an objective measure of your coding results’ consistency. If your three researchers coding results were significantly inconsistent, IRR statistics would illustrate this, and would signal that the researchers are either not following the coding scheme correctly, or that they are following the scheme inconsistently and allowing their subjective biases or ideas to affect their coding process. Subjective coding can obviously skew research results, especially if there are multiple coders, since you bring additional, often varying biases from each new coder on the team, and lead to even less uniform results. Thus, IRR is a valuable tool to test how objectively and consistently the coding is actually being done. This IRR test, helps researchers or project managers identify coding problems and fix them before inaccurate results are reported.
Reason 2: IRR can tell you if your results are valid.
Depending on the project manager’s discretion, IRR may be used to indicate valid (correct) results or accurate (exact) results in a qualitative project. If IRR is being used to ensure the validity of your project’s results, then the associated statistic of percent agreement should be used. Percent agreement calculates the number of coder agreement scores divided by the number of total scores. When comparing coder similarity, if percent agreement is around 90% or above, then this is considered a high percent agreement between coders. Since percent agreement is a statistic of IRR, once a 90% or higher score is achieved, you can be comfortable reporting your qualitative project’s coding results, IF your goal is result validity.
Reason 3: IRR can tell you if your results are accurate.
If IRR is being used to ensure accuracy, or exactness, of your project’s results, then researchers will have to focus on the kappa score, and not on the percent agreement score. The kappa score is a measure of agreement between two or more individuals that accounts for the possibility of agreement happening by chance. The kappa score can range from −1 to +1, where 0 represents the amount of agreement that can be expected from random chance, and 1 represents perfect agreement between the raters. If your IRR kappa score indicated a value between 0.4-0.8, then you can be satisfied that you have high coder similarity, and report your results with high accuracy.
The downside to obtaining a high kappa score, or accuracy of results, is that it will be more difficult than obtaining a high percent agreement, or validity or results. Since obtaining a high kappa score is more rigorous than a high percent agreement, it requires more rounds of coding with a larger number of coders, and recalculating and analyzing the kappa score in each round to ensure a high kappa score is not only obtained, but replicable for the coders. This will also mean acquiring a high kappa score is a larger time commitment than a high percent agreement, but if accurate coding results are your goal, then this process is valuable.
Reason 4: IRR illustrates the strength of our coding scheme.
If IRR between three researchers coding the same 20 interviews is low (a low percent agreement and/or a low kappa score), it signals that the coding scheme (or unified coding instructions) being used likely has some gaps. Low IRR may mean that even though researches are using the same coding protocol most of the time, missing categories in the coding scheme itself are causing large coding inconsistencies in a few items. For this reason, IRR is a good tool to give us insight into the strength of our qualitative coding scheme. Paying attention to IRR also gives us the opportunity to revise our coding scheme if it’s not proving clear enough for all researchers to consistently code with.
The purpose of IRR and the process for deriving its related statistics support qualitative researchers to reduce subjectivity and improve objectivity just as quantitative researchers do. Ensuring highly valid and accurate results are at the forefront of qualitative studies, and IRR is only one example of this goal.