Developing Educational Tests: How to Do It / ITech content

Researcher at the Laboratory of Measuring New Constructs and Test Design at the Center for Psychometrics and Measurement in Education at the Institute of Education of the Higher School of Economics. He is the head of the 4K project, which focuses on measuring critical thinking, creativity, communication, and cooperation. He is also a lecturer at the Institute of Education, where he teaches courses on psychometrics and measurement methodology in psychology and education. Specializes in the development of innovative methods for assessing educational outcomes and psychological characteristics.

In this article, we will discuss the process of developing educational tests designed to test subject-specific knowledge and skills. The principles developed here can also be applied to tests aimed at assessing psychological qualities or soft skills. The rules for creating effective tests are universal and include clearly worded questions, using a variety of task types, and ensuring objective assessment. Our goal is to help you create high-quality tests that effectively measure participants' knowledge and skills.

In this material, you will gain information about what interests you. We will cover key aspects of the topic to help you better understand the subject matter. A clear understanding of the topic will allow you to make more informed decisions and increase your knowledge in this area. Expect in-depth analysis and helpful tips that are relevant to your situation.

Is it possible to test a student's reasoning skills using a multiple-choice test?
Is it possible to measure not just knowledge of facts, but also comprehension of the educational material with a test?
How do difficult tasks differ from difficult ones, and why a test can be difficult, but it is better not to make it difficult?
Which tasks are better to start with - easy or difficult ones;
What number of answer options is optimal;
How to check if a test works.

What you need to know before developing a test

Psychometricians define a test as a tool for measuring various characteristics, including role-playing games, essays, and portfolio assessment. In this context, we will focus on the most common type of test, which in Russian is denoted by the term "test" in the narrow sense. This concerns multiple-choice questions. Such tests are widely used in psychology and education to assess knowledge, skills, and personality traits. We will examine their structure, application, and advantages, as well as their impact on test results.

Standardized tests in education have a controversial reputation. However, psychometricians continue to support this assessment method, arguing that multiple-choice tests are scalable, fair, and objective. This means that the same test can be used to assess an unlimited number of students, ensuring a level playing field for everyone. Importantly, test results are not influenced by external factors or subjective opinions.

However, any psychometrician will note that multiple-choice tests are not a universal tool. The choice of assessment method depends on the specific construct to be measured. In psychometrics, a construct is a mental property or ability that cannot be directly observed but can be assessed through external behavioral manifestations. This highlights the importance of selecting adequate assessment methods to obtain reliable results.

Photo: BublikHaus / Shutterstock

Multiple-choice tests are ideal for assessing knowledge across a variety of constructs, especially when it comes to Factual information and specific technical skills. They effectively test understanding of key concepts, such as command division in programming languages. Using such tests helps identify knowledge levels and quickly assess abilities in a specific area. Multiple-choice tests are ineffective for assessing students' skills in reasoning, interacting with colleagues, and finding practical solutions to complex situations. The more complex the skills to be assessed, the more adaptive and multifunctional the instrument for measuring these competencies must be. Effective assessment methods must take into account the specifics of interaction and analytical skills, which makes traditional tests insufficient for a comprehensive analysis of student proficiency.

Tests can only reveal certain types of knowledge. Every subject has basic facts that can be used to create multiple-choice questions. For example, you might ask about the year Christopher Columbus discovered America. Such questions help test the comprehension of basic information, but they don't always reflect the depth of knowledge. For a more comprehensive assessment of knowledge, it is necessary to use a variety of question formats that require critical thinking and analysis.

There are elements of knowledge that require more than memorization to master. For example, if we want to understand the events and phenomena that led to the discovery of America, multiple-choice questions will not be as effective. To deeply understand this topic, it is important to analyze historical contexts, evaluate the influence of various factors, and comprehend the consequences, which requires more complex teaching methods and understanding.

Every teacher strives for students to not just memorize facts but also to master the material on a deep level. However, measuring understanding remains a complex task today. Perhaps in the future, neuroscience will provide us with tools for monitoring the processes occurring in the brain of each student. Currently, psychometrics focuses on observable aspects and behavioral manifestations, but universal criteria for comprehension have not yet been developed.

In pedagogical measurements that require the assessment of deeper, non-factual knowledge, the emphasis shifts from simple comprehension to the ability to interpret and analyze information. In this context, open-ended tasks, as well as computer simulations and games, prove more effective than multiple-choice tests. Such tools create a more flexible testing environment, allowing for a better assessment of students' real-world skills and knowledge.

Creating a Test: Practical Tips

If your goal is to assess the acquisition of factual knowledge or specific skills, a multiple-choice test is ideal. Creating and distributing such a test does not require complex digital platforms. Tools such as Google Forms or Yandex are sufficient for basic tasks. These services make it easy to develop surveys and tests, making the knowledge assessment process simple and accessible.

In this section of the article, we will take a detailed look at the key aspects of creating a high-quality test. If you want to deepen your knowledge on this topic, we recommend paying attention to the book by Haladyna T. M. and Rodriguez M. C. "Developing and validating test items" (Routledge, 2013) and other works by these authors. Unfortunately, this edition is not available in Russian.

Students often feel tired by the end of testing. As a result, the last tasks do not always reflect the level of their knowledge. This indicates that the duration of the test should be optimally limited.

Short tests have low reliability. A student may give an incorrect answer due to inattention or, conversely, accidentally guess the correct option. Longer tests reduce the likelihood of such errors, since random errors can compensate for each other. Thus, the more questions in a test, the higher the likelihood of obtaining a reliable result. Reliable tests are an important tool for assessing knowledge and skills, so it is worth paying attention to their length and structure.

Photo: Achira22 / Shutterstock

Determining the optimal test length is based on the time, required to complete one task. This time varies depending on the complexity of the questions and can be from a few dozen seconds to five minutes. It is also important to consider the age of the students, as younger students may require more time to solve problems than older ones. When developing a test, you should strive for a balanced number of questions to maintain the attention and interest of students, as well as to ensure the opportunity to adequately assess their knowledge.

Children before adolescence should not be given a task longer than 20 minutes - or it is necessary to provide an opportunity for a break in the test.
For older teenagers and students, as well as adults, it is better to proceed from the duration of a usual lesson. For example, for a high school student, it is normal to devote a 45-minute lesson to a test (or two lessons with a break in between). And for students, an 80-minute test can already be taken.
In additional education for adults, it should be taken into account that an adult no longer considers himself obliged to participate in any tests. He needs additional motivation. For example, you can promise individual feedback on test results (and then be sure to provide it!).

The golden rule is: the more time in the course devoted to a particular topic, the more questions will be on the final test. This is due to the fact that when developing a course, an emphasis is placed on the most significant topics, which implies their in-depth study and, accordingly, an increase in the number of questions to check the assimilation of the material. Therefore, it is important to understand that the distribution of hours has a direct impact on the structure of the final test and the level of student preparation.

It is recommended to ask at least three questions on each topic, unless they are too specific. This will balance out random errors and obtain more objective results. It is important to provide feedback not only on individual tasks, but also on the topic as a whole to ensure a deep understanding of the material. This approach promotes more effective learning and helps identify key aspects requiring additional attention.

In psychometrics, tasks can be difficult, but not complex. Difficulty in this area is defined in the same way as in Russian. Successfully completing a difficult assignment requires a high level of knowledge on the topic. As a result, only a few students will be able to cope with such tasks.

Difficulty is an important psychometric concept that determines the number of actions and cognitive operations required to solve a problem. Consider a mathematical example: dividing 0.219 by 0.365 is considered difficult, but not complex, since only one action is required to complete it. Thus, the difficulty of a task may be related to its perception, while complexity is determined by the number of steps required to achieve the result.

It is recommended to begin testing with simpler tasks, since stress levels are usually higher at the beginning, which can negatively affect results. If the test consists of thematic blocks, it is advisable to arrange the tasks in each of them in order of increasing difficulty - from easy to difficult. This approach contributes to a more accurate assessment of knowledge and reduces anxiety in participants.

The issue of distribution across thematic blocks is complex and multifaceted. On the one hand, it is advisable for the test taker to focus on one specific topic throughout the test. This allows for a more thorough analysis of their understanding and skills in a particular area. On the other hand, there is a need to assess the test taker's ability to quickly switch between different tasks and problems. This approach allows us to determine the level of adaptability and multitasking, which is also important in today's dynamic world.

Photo: roibu / Shutterstock

The testing method depends on the specific discipline and the objectives of the test. Ensuring a level playing field for all participants is key, allowing for comparable results. Testing should be organized so that each test-taker has access to the same resources and information, which promotes an objective assessment.

Dividing the test into blocks is an important practice, as it allows test-takers to recognize that the test has certain boundaries. In a computer-based testing environment, where it is impossible to scroll through the tasks and estimate how many questions remain, this becomes especially important. Furthermore, it is important to inform participants in advance of the time limits for answering questions so that they can manage their time wisely. This promotes more efficient testing and reduces stress levels in test-takers.

Today, the most common form of testing is tests similar to those used in the Unified State Exam, which offer four answer options. It is believed that the choice of exactly four options is due to the limitations of human working memory: it is believed that the average person can simultaneously retain approximately four elements in their mind. This explanation highlights the importance of designing tests that take cognitive characteristics into account, which contributes to more effective assessment of knowledge.

Cognitive psychologists consider this rationale unscientific. Most likely, the four answer options were chosen at random, and there is nothing biologically or psychologically determined about this number. Other options are also possible, such as only three answers.

Creating more incorrect answers is often a difficult task.

Creating incorrect answer options is a complex psychometric art. These options, known as distractors, are designed to divert attention from the correct answer. Effective distractors must be logical and plausible, so that the test taker cannot easily identify the correct answer. High-quality incorrect answers require a thorough understanding of the topic and the specifics of the questions, making their development an important aspect of testing and assessment.

A key aspect of creating test questions is the need to formulate incorrect answers so that they appear plausible and are similar to the correct answer. This helps avoid confusion and increases participant engagement. For example, if a question asks "What year?", all answer options should represent dates within the same time range. This will prevent participants from easily eliminating incorrect options, making the test more challenging and engaging.

Incorrect answer options should not include the correct answer or part of it. If such an option does exist, it is necessary to clarify in the question that the test taker must select the most correct answer. This will help avoid confusion and ensure accurate assessment.

The highest level of skill is the analysis of typical student errors based on their incorrect answer options. This approach allows for more extensive and in-depth feedback. Instead of simply pointing out errors, we explore why the student chose that particular incorrect option. This promotes a better understanding of the material and improves the learning process, helping students avoid repeating the same mistakes in the future.

In addition to reliability, an important quality of any test is validity. According to the classical definition, validity is the ability of a test to measure exactly what it is intended to measure. The modern understanding of validity implies that the results of a valid test can be interpreted in accordance with the logic on which it was developed. Test validity plays a key role in ensuring its effectiveness and accuracy, as it ensures that the data obtained truly reflect the phenomena under study.

Sometimes the validity of results can be affected by how the test taker perceives the situation in the task. Even if their view differs from the generally accepted one, this does not necessarily indicate an error. It is important to consider the variety of interpretations that may arise during the testing process.

Let's consider an example from a critical thinking test created at the Higher School of Economics. This test is a simulated online environment in which the participant interacts with a bot. One of the main tasks is to obtain missing information to create a cake recipe. This approach helps assess critical analysis skills and the ability to ask the right questions, which are important aspects of learning and decision-making.

The test taker must ask the bot a specific question, such as, "How many eggs should I add?" However, sometimes people begin with a greeting, such as, "Hi, how are you?", and this is perfectly normal before asking the recipe. If this aspect is not taken into account when designing the test, such answers may be incorrectly assessed as errors. This highlights the importance of properly formulating test items and taking natural communication into account.

One common concern about tests is the "guessing game" problem, which leads to the suggestion to increase the number of answer options. It seems that with two options, the probability of a correct answer is 50%. However, this statement is only true if the test consists of a single question with two answer options. In reality, when a test includes several questions, the probability of guessing the correct answer to each of them can vary significantly depending on the total number of questions and the difficulty of the questions. Thus, adding more answer options can not only reduce the likelihood of guessing, but also increase the level of awareness and analytical thinking in the test taker.

Photo: roibu / Shutterstock

Adding a second question that does not contain hints to the first will lead to a multiplication of the probabilities. As a result, the chance of randomly guessing the correct answers increases to 25%. In the context of a test consisting of ten questions, the probability of answering all correctly by chance becomes practically zero.

Such a calculation is justified only in cases where the tests contain carefully formulated incorrect answers.

How to check if a test works

In the Institute of Psychometrics Education's Master's program, students study methods for assessing and verifying the reliability of tests over two academic years. The focus is on psychometric theories, statistical methods, and practical skills necessary for analyzing test data. As part of the program, students master various approaches to test development and validation, which allows them to effectively evaluate their performance and quality. The program provides in-depth knowledge in psychometrics, which helps train qualified specialists capable of solving current problems in assessment and testing.

Testing can be conducted using qualitative or quantitative methods. The qualitative method involves an interview, during which the test developer presents tasks to a representative of the target group, observes their actions, and asks clarifying questions. This approach allows us to determine how clear the task is, what steps the test taker takes to solve the task, and which aspects cause difficulty or, conversely, seem too simple. This analysis helps improve the test and make it more effective for future use.

Qualitative testing of tests aims to confirm that the tasks being solved actually activate the necessary cognitive processes. This means that the test taker must not simply choose from the proposed options, but actively solve the mathematical problem. It is important that distractors do not contain elements of the correct answer, and that all instructions are clear and understandable. This facilitates a more accurate assessment of the knowledge and skills of test participants, which in turn increases the reliability of the results.

Evaluating test performance using quantitative methods is a key aspect of psychometrics. This assessment is carried out through statistical analysis, which requires approximately 100 observations to achieve reliable results.

It is not always possible to conduct a full assessment for each course, so interviews are often sufficient. However, quantitative assessment of test results becomes necessary when decisions about course admissions or certificate issuance are based on this data. This approach ensures objectivity and transparency of the process, which is especially important for educational institutions and course participants.

As a result of testing, the developer receives the same data as from a qualitative assessment. However, quantitative assessment provides additional opportunities for analysis: it allows for the identification of questions and statements that do not meet their objectives, as well as those that are redundant and do not affect the test results. This makes testing more effective, allowing for optimization of its structure and content.

Revised text:

Be sure to check out our other materials.

Psychometrics: What is it and why is it needed in education?
How to take an online test in the Russian service myQuiz
How to make tests that work. An excerpt from the Edutainment textbook.

The Methodologist profession from scratch to PRO.

You will improve your skills in developing curricula for online and offline courses. You will master modern pedagogical practices, structure your experience and become a more sought-after specialist.

Find out more

Developing Educational Tests: How to Do It / ITech content

Contents:

What you need to know before developing a test

Creating a Test: Practical Tips

How to check if a test works