How Do We Create or Standardize a Psychometric Test?

In definition, a standardized test is administered and scored in a consistent, or “standard”, manner. They are, in fact, designed in a way that stabilizes questions, conditions for administering, scoring procedures, and interpretations as consistent.

Standardized testing could be composed of true-false, multiple-choice, authentic assessments or essays. It’s possible to shape any form of assessment into standardized tests. When it comes to the creation of psychometric assessments, questions are measured in scales. [1] And these too are often most valid with standardization post-creation.

At the same time, in terms of creation, psychometric tests are subject to scrutiny via validity and reliability tests. On the other hand, norming frames the standardization process.

A test is reliable as long as it produces similar results over time, repeated administration or under similar circumstances.

For example, it’s reasonable to expect a line that measures five centimetres on one scale to measure the same on a different one. The line is essentially the same, and only a good scale can ensure it remains the same five centimetres regardless of what or who measures it. When compared to psychometric assessments, a reliable test is like that scale, with the ability to produce stable results over time.

Over the years, scholars and researchers uncovered multiple ways to check for reliability.[2]Some include testing the same participants at different points of time or presenting the participants different versions of the same test to see how consistent the results are.

Suffice it to say that an assessment has to show demonstrably good reliability in order to qualify for validity.

Understanding Validity in Psychometric Tests

It is understandable to expect a test used in organizations to shed light on how a candidate would perform in a particular job. With this in mind, it is essential to reiterate the difference between reliability and validity, with the former being a prerequisite to the latter.[3]

Let’s consider a dart player. In repeated trials, he or she continues to miss the mark consistently by about two inches. Of course, this implies reliable aim. Each shot hits the board in a region two inches from the target. It’s difficult to not question his validity as a professional - considering he or she doesn’t hit the bullseye as is the aim of all professional dart players - in comparison to his or her peers.

Reliability and validity go hand in hand, but reliability by no means indicates validity. As our example showed, having the first without the second hints at great consistency, but also inaccurate consistency. There are tests for validity.

Why Norming is Essential to Psychometric Tests

Even with a test that is both reliable and valid, there exists a question about results. An assessment fails without quantifiable results, but as often stated – human beings are far from quantifiable.

It is hard to quantify in a vacuum, competencies such as ethical integrity or teamwork; similarly, seeing a score on a personality test may be meaningless without a guide to interpret it. Experts distinguish a group of candidates from the other by comparing them to a standard – either among themselves, a relative standard, or an external criterion, an absolute standard.

The first way is to compare people against a population of interest, and this is what’s more commonly referred to as norming. Another way is to have a solid standard against which you measure your assessment, using that standard to make decisions. Either way, it is required of a test developer to define a cut-off score for hiring or any other decisions dependent on the assessment.

However, even that is a delicate ball game. If you think about it, picking the relatively best apple from a batch of rotten apples would still yield a rotten apple. How then would you ensure good results from a good test? In psychometric tests, to assess overall performance, researchers have employed standardization samples, which simply refer to a large sample of test takers who represent the population for whom the test is intended.

A representative sample means using a group of children when developing a test for children, and an adult group when developing a test for adults. Also, based on the population, samples are generally made representative based on demographic factors like age, gender, education, religion, etc.[4]

When you get a 94th percentile on a trait like extraversion, you know that you are simply more extraverted than 94% of the sample group from whom the test makers derived the normal distribution. On the other hand, if you scored 94% on a math test, it simply implies that you marked about 94 in every 100 questions correctly.

Psychological constructs such as personality have no right or wrong answers associated with them, and can thereby not be marked using percentages.[5] This is why academics and researchers alike resort to norming among other methods to make sense of scores on personality assessments.

With growing concerns over costs, conveniences and other logistical challenges, technology-enabled assessments have become popular over time as well. Simply because they serve to streamline the process, reduce costs, increase efficiencies, allow employers to assess, and analyse more data points than previously deemed possible.

Know-how about the creation or standardization of psychometric tests aside, it’s also an imperative to understand how best one can determine the quality of a psychometric test.

  2. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: Authors.
  3. Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational Measurement (2nd Ed.). Washington, D. C.: American Council on Education.
  4. P243, Keith Coaley, An Introduction to Psychology