An Inspera primer on Item Response Theory, the Engine of e-Assessment

Most e-Assessment companies rely on Item Response Theory (IRT) to give insights into the performance of Learners. IRT has some great advantages, as we can use this body of latent trait models to learn more about our questions and learner performance, and to create more advanced assessment forms.

Item response theory is the workhorse of modern-day psychometrics. e-Assessment platforms rely on this body of “latent trait models” to give Educators better insights into the performance of their learners, and to help them make better tests, and questions, and implement more advanced test forms.

What is Item Response Theory?

But what is Item Response Theory? Put in rather dry terms, IRT is a body of latent trait models that are used to measure a specific “concept” from responses to “items”. When we apply these models to learner responses to questions, we obtain an overall ability score for each learner.1 In other words: we take directly observable data, and use this to construct a variable that is not directly observable, in this case, ability.

And, IRT is not limited to psychometrics alone. Latent trait models may be used for many different kinds of response pattern data. In psychology, for example, it is used to generate composite measures of “happiness”, or other psychological characteristics that are not directly observable. And, in political science, IRT (both Bayesian and frequentist) has traditionally been used to place legislators in a single ideological space based on their voting records on bills.

When we use IRT for psychometric applications, we model the probability of giving the correct answer as a function of an examinee’s ability level and item characteristics. This is illustrated in the figure below, which shows what we call an “item characteristic curve" for a model where discrimination is constant and only difficulty varies between items (known as the 1-parameter logistic model or the "Rasch" model). As we go along the ability scale on the x-axis (which is constrained to the range of -3 to 3, but can be mapped onto different grades), the corresponding value on the y-axis tells us the probability of a correct answer for this particular question. This means that we are able to tell how difficult an item is for an individual that has a particular skills level.

Although it was formulated as early as in the 1950s, IRT only became accepted as a superior way of measuring student performance than “classical test theory” (CTT)2 in the 1980s. Combined with better availability of computing power, psychometricians were free to explore some of the advantages of IRT.

How can IRT help in my educational assessments?

What are the advantages of IRT that Educators and Learners can benefit from? First, with IRT we obtain estimates of student performance that are comparable between cohorts. As our parameter estimates are independent from the sample from which they are estimated, we can easily rescale the results from one cohort to the original ability scale from which items were calibrated. Provided we include overlapping items (“anchor items”), we even obtain comparable scores from different tests. We are therefore better able to compare Learner performance between different groups, or between different cohorts of Learners taking a test year-on-year.

With IRT, we can learn more about the performance of our questions

Second, IRT can be used to learn more about the performance of our questions. While experienced item writers rely on their expertise to write good questions, it cannot hurt to test our assumptions of how an item “behaves” in a test. An IRT model estimates, for example, the difficulty of the question at different levels of ability and its discrimination.3 We can, using this information, calculate an item’s information function, which tells us how much information the item provides at different levels of ability.4 These characteristics can be used to make decisions over which items to change, re-use, or retire.

Third, IRT opens up the prospect of more advanced assessment forms. Using item parameter information from IRT models we can use test construction items to generate appropriate test forms, either as several versions in a linear setup (“linear on-the-fly”), or in a computerised adaptive test (CAT). For example, in contrast to CTT, in IRT difficulty is a function; not a number: items may be difficult for low-ability Learners, but easy for high-ability Learners. When we have estimated a student’s ability

In IRT, difficulty is a function; not a number

Linear on-the-fly tests help safeguard the integrity of exam material, as educators can rely on many different test forms to test on the same subject. CAT, in turn, is better adapted to the ability of individual students, optimising both the information we obtain and learner experience.

In sum, IRT gives us the tools to conduct better Learner and item analyses, to improve our assessments, and to create more advanced examination formats. It is no wonder, therefore, that it has truly become the workhorse of modern psychometrics, and the go-to body of statistical models for E-assessment solutions.

Further readings

Lord, F. M. 1980. Applications of Item Response Theory to Practical Testing Problems. Lawrence Erlbaum.

Lord, F. M., and M. R. Novick. 1968. Statistical Theories of Mental Test Scores. Reading, MA: Addison- Wesley.

Notes

1. Importantly, we assume here that we are measuring a uni-dimensional concept, e.g. algebra, reading, geometry, etc.. In practice, we have to test the assumption that the variation in student responses relates to a single latent variable (often using factor analysis).

2. CTT estimates the performance of students in a way that is applicable to the context of one test only. A student’s ability is the total score on the test, the estimated difficulty (or rather: facility) is the proportion-correct, and the discrimination is measured by the item-total correlation. Instead, in IRT, the test scores are relevant between tests. Put simply, in IRT, test takers will have higher true scores on easier tests, and lower true scores on more challenging tests, but their ability scores will remain constant.

3. Different parameterisations of the IRT model exist. The 1-PL model for example assumes that discrimination (a) is constant across items and only estimates difficulty, whereas the 2-PL allows both parameters to be estimated. 3-PL in turn estimates an additional, pseudo-guessing parameter (“c”).

4. Specifically, in a 2-parameter logistic model information is proportional to the squared discrimination term, whereas in the 3-parameter logistic model, it is a function of both discrimination and guessing.

Learn more?

Do you want to learn more about how to conduct online exams and assessments, and how Inspera Assessment can make your assessments accessible, secure, valid and reliable? No worries, we wrote a "Guide to Online Exams and Assessments".

Topics: IRT, statistics, learning analytics, question analytics

Niels Goet

Written by Niels Goet

Data Scientist, Ph.D - Inspera Assessment