Section 5: Measuring Performance
How can we translate abstract organizational goals into tangible outcomes that can be easily measured? What types of metrics and performance indicators have worked well in specific organizations? What are the limits to a performance or outcome-based perspective in the public sector (or alternatively, what types of programs will be difficult to track)?
Validity and Reliability
Kimberlin, C. L., & Winterstein, A. G. (2008). Validity and reliability of measurement instruments used in research. Am J Health Syst Pharm, 65(23), 2276-84.
Author: Hanson, Keely; Editor: Hanson, Keely
The purpose of this article is to review issues related to the validity and reliability of measurement instruments used in research. The article provides background on why using tests or instruments that are valid and reliable to measure such constructs is a crucial component of research quality. Given that public organizations have many areas where metrics and performance indicators may be subjective in nature, this article explains how reliability and validity can be useful in assessing performance in order to mitigate subjectivity and effectively translate organizational goals into tangible outcomes.
Reliability and validity are highlighted as key indicators of the quality of a measuring instrument and are defined as follows:
• Reliability-- how estimates evaluate the stability of measures, internal consistency of measurement instruments, and interrater reliability of instrument scores.
• Validity-- the extent to which the interpretations of the results of a test are warranted, depending on the particular use the test is intended to serve.
Evaluating the quality of measures
Whether it is healthcare, education or beyond, there are many organizations in the public sector where data collected encounters a greater degree of subjectivity in judgment, or other potential sources of error in measurement. Managers and their research team must control for known sources of error and report the reliability and validity of measurements used.
Reduction of error is one of the main focuses when developing and validating an instrument. There are several measures that reliability estimates are used to evaluate. Reliability is used to evaluate the, “the stability of measures administered at different times to the same individuals or using the same standard (test–retest reliability).” It also looks at “the equivalence of sets of items from the same test (internal consistency) or of different observers scoring a behavior or event using the same instrument (interrater reliability).” These specific items within reliability that public administrators can evaluate when interested in performance evaluation are: stability, internal consistency, responsiveness, and interrater reliability.
- Stability of measurement, or test–retest reliability, is “determined by administering a test at two different points in time to the same individuals and determining the correlation or strength of association of the two sets of scores.” It essentially helps evaluators understand how consistent or stable the results of a test being administered are.
- When evaluating stability, the time interval in between test administrations is very important and critical to the assessment. The article explains that it is ideal that the, “interval between administrations should be long enough that values obtained from the second administration will not be affected by the previous measurement (e.g., a subject’s memory of responses to the first administration of a knowledge tests, the clinical response to an invasive test procedure) but not so distant that learning or a change could alter the way subjects respond during the second administration.”
Internal consistency is a measure of how different items assess the same construct or idea on the same test. This gives public managers the ability to understand how well an item of interest is being accounted and the consistency of their results. The most widely used method for estimating internal consistency reliability is Cronbach’s alpha, which is “a function of the average intercorrelations of items and the number of items in the scale.”
Responsiveness is the ability of a measure to detect change over time in the construct of interest. Reliability is a crucial component of responsiveness and can be very useful in the public sector to identify interventions that are effective in changes of interest.
Interrater reliability helps illustrate the degree of consensus amongst observers. It is also know as interobserver agreement, and can be important in ensuring that raters making subjective assessments have consistency. It “establishes the equivalence of ratings obtained with an instrument when used by different observers.” A reliable measure that involves judgments or ratings by observers, will observe consistency between different raters. There must be No discussion or collaboration can occur when observers are giving ratings in order to maintain the integrity of interrater reliability.
Validity is a very relevant construct for public administrators given the public nature of their work. It is known as “the extent to which the interpretations of the results of a test are warranted, which depend on the test’s intended use.” Evaluations in the public sector need to be mindful of validity, because it is “the extent to which an instrument measures what it purports to measure.” In evaluation, an instrument can be reliable without being valid. Public administrators have a responsibility of transparency and are depended upon to have the upmost integrity in what they provide to the public; that is why validity emerges extremely important, especially in respect to subjective constructs.
Whether it is in health, education, or other arenas of the public sector, there are many cases in performance evaluation that may require quantifying attributes that cannot be measured directly. These explanatory variables that are not observable are called hypothetical constructs. Hypothetical constructs “cannot be measured directly and can only be inferred from observations of specified behaviors or phenomena that are thought to be indicators of the presence of the construct. An operational definition of a construct links the conceptual or theoretical definition to more concrete indicators that have numbers applied to signify the “amount” of the construct.” How this construct is operationally defined and quantified is the core of the measurement.
This type of validity assesses how well the test or experiment measures the construct of interest. Evaluation of construct validity requires “examining the relationship of the measure being evaluated with variables known to be related or theoretically related to the construct measured by the instrument.”
Content validity is usually assessed based on the judgment of experts in the field. It addresses how well the items developed to operationalize a construct, “provide an adequate and representative sample of all the items that might measure the construct of interest.”
This type of validity "provides evidence about how well scores on the new measure correlate with other measures of the same construct or very similar underlying constructs that theoretically should be related." Selecting an appropriate and meaningful criterion measure can be a challenge because often the criterion a researcher would like to be able to predict is too far ahead in time or too expensive measure.
Selecting an existing instrument
Especially in the public sector where assessments have long been utilized and are often required, evaluators should identify any existing instruments that measure the construct of interest before developing a new test or measure. It can be more cost-effective and time-efficient. The article also highlights the following questions to ask when selecting an instrument:
1. Do instruments already exist that measure a construct the same or very similar to the one you wish to measure?
2. How well do the constructs in the instruments you have identified match the construct you have conceptually defined for your study?
3. Is the evidence of reliability and validity well established? Has the measure been evaluated using various types of reliability estimates
4. In previous research, was there variability in scores with no floor or ceiling effects? D
5. If the measure is to be used to evaluate health outcomes, effects of interventions, or changes over time, are there studies that establish the instrument’s responsiveness to change in the construct of interest?
6. Is the instrument in the public domain? If not, it will be necessary to obtain permission from the author for its use. Even though an instrument is published in the scientific literature, this does not automatically mean that it is in the public domain, and permission from the author and publisher may be required
7. How expensive is it to use the instrument? A mail questionnaire costs less to administer than do telephone or face-to-face interviews.
8. If the instrument is administered by an interviewer or if the measure requires use of judges or experts, how much expertise or specific training is required to administer the instrument?
9. Will the instrument be acceptable to subjects? Does the test require invasive procedures?
As a public manager interested in evaluation, there must be great attention paid to how instruments for data collection are developed, and should include pilot testing to determine their reliability and validity. This will help ensure the quality of research conducted.
One of the primary threats to quality assessments is the accuracy of data collected. In the public sector, there may be much use of self-reported **and/or **secondary data sources. Using secondary data may have particular relevance for data assessment in the public sector. Collecting secondary data is generally a standard practice for public organizations. However, the article highlights that it is imperative to “verify that the data set appropriately measures the variables required to answer the research questions.” Self-reporting can be particularly subject to issues of responses with social desirability biases or perceived impressions that need to be made. This should be considered when public managers seek to assess or collect data, especially when conducting surveys or questionnaires. One strategy to mitigate this when asking questions about frequency of behavior, is that it usually best to let the subject fill in the blank on an item with a clearly defined reference period. There are other ways to control for social desirability biases and evaluators should defer to such when using self-reporting measurements.
In the public sector, there are many variables of interest and outcomes that are important that will inevitably be abstract in nature. Using tests or instruments that are valid and reliable to measure such constructs is a critical aspect of ensuring research quality, and becomes even more imperative when assessing services and outcomes within the context of the public sector.
Key Performance Indicators
USAID (1996). Performance Monitoring and Evaluating Tips: Selecting Performance Indicators.
Author: Checksfield, Molly Wentworth; Editor: McCully, James I
The USAID Center for Development Information and Evaluation has developed criteria for how to create performance indicators to determine a program's success in achieving its objectives. This methodology expands past a results statement that shows what objectives are hoped to be achieved but rather a measure to determine if the objective has been achieved. Indicators are generally quantitative measures but can also be qualitative observations. They define how performance will be measured along a scale or dimension, without a specific level of achievement (Targets are separate from indicators). These evaluations serve as extremely valuable parts of an organization's reflection process while working towards its goals.
Being able to monitor performance is important for the wellbeing of an organization, as well as the ability to measure how a program is progressing towards its intended goals and outcomes. They are an indispensable management tool for making performance-based decisions about program strategies and activities.
However, many organizations have difficulty determining how to measure performance indicators most efficiently. Selecting appropriate and useful performance indicators requires careful thought, iterative refining, collaboration, and consensus-building. According to USAID, there are four steps in selecting proper performance indicators:
**Step 1. Clarify the results statements: **
The first step in determining effective performance indicators is being able to clarify a common goal that the organization is striving for. This goal should be articulated as clearly and concisely as possible. It is imperative that all those working towards the goal are aware of the team’s objectives so that results can be measured most accurately.
When developing results statements, it is best to use precise language, as individuals may interpret broad statements differently.
Clarification of the type of change being sought is important as well, whether behavioral, situational, conditional, attitudinal, etc. Also, it is important to indicate whether the goal is absolute change, relative change, or no change.
Absolute change: creating something new
Relative change: “increases, decreases, improvements, strengthening or weakening in something that currently exists, but at a higher or lower level than is considered optimum” (USAID, p. 2)
No change: maintaining the status quo.
Proper identification of where change should occur is important as well- this is also known as the “unit of analysis”. Change can occur among individuals, groups, communities, regions, etc.
It is necessary to provide clarity about how different types of activities will directly lead to intended results in order to determine reasonable expectations for outcomes. In contrast, certains activities will produce the change less directly. Therefore, how direct of an effect of change should also be conisdered.
Step 2. Develop a List of Possible Indicators
There are a variety of ways to evaluate outcomes, but it is advised to select indicators that are best suited for the type of results desired. One way to determine how to choose an indicator is to list the types of indicators possible to use in a given situation. Always keep the objective in mind, consult with the experts of that field, and utilize the experience of other units with similar indicators.
Inclusivity is also vital to the development of a robust list of performance indicators so that all perspectives are on the table when choosing the most appropriate levels of measurement.
Step 3. Assess Each Possible Indicator
The next step is to assess all possible indicators brainstormed in step 2. There are seven criteria to evaluate such indicators:
1) Performance indicators should measure the intended result as closely as possible. Direct measures are a more accurate representation of performance than indirect ones.
2) Objective measures reduce ambiguity and levels of interpretation.
3) Performance measures should most adequately measure the result in question.
4) Descriptive, quantitative analyses are often preferred over qualitative measures where possible because they yield more straightforward results. The type of indicator (quantitative or qualitative) should be determined on a case to case basis.
5) Disaggregated data indicate differences in how the outcomes affect groups. It is best for management to be able to measure how performance indicators affect a variety of different groups, rather than the group as a whole.
6) Practicality of an indicator is important because the cost and time necessary to obtain data should be used in assessing whether an indicator is acceptable for achieving the goal.
7) Reliability of the data is an important factor in choosing a performance indicator based on the data
A helpful way to choose performance indicators is to create a spreadsheet to assess each indicator by rating each criterion from 1-5 and then determine best fit based on the overall score of each indicator.
**Step 4. Select the “Best” Performance Indicators: **
The final step is to narrow the performance indicator list down to those of best fit based on the intended outcome. The cost and practicality of each performance indicator should be taken into consideration when determining the best measurements of performance.
** Conclusion: **
It is important for an organization to brainstorm which performance indicators are best suited to measure success so that goals are realized most efficiently. The data collected by the performance indicators measures the progress of an organization, and is therefore vital to the mission and life of the organization.
SMART Criteria for Performance Measurement.
Author: Rodriguez Ranf, Daniela; Editor: Gobbo, Andre Francis
The SMART Criteria article is about how to evaluate and develop clear and useful objectives. SMART is an acronym comprised of key measures for judging objectives: **Specific, Measurable, Achievable, Relevant, and Time-bound (please note that different sources exchange words like attainable instead of achievable, or realistic instead of relevant). The development of the SMART criteria have been most commonly credited to Peter Drucker’s Management by objectives (MBO) concept, also known as management by results (MBR). This concept defines a process within organizations where management and employees are in agreement of objectives and goals, and understand what they need to do to achieve them. SMART criteria does not imply that all objects must be quantified at all levels of management. The utility of SMART criteria should be seen as a combination of the objective and the action plan.
DEVELOPING SMART GOALS
The following characteristics of S.M.A.R.T goals are described by Paul J. Meyer in Attitude is Everything.
Specific: This first criterion focuses on scope. It stresses the need to be specific when developing objectives or goals in order to avoid misinterpretations, and to provide as much clarity and direction as possible. A good test to see if a goal is specific enough is to apply the five “W” questions:
_- What: What do you want to accomplish?
- Why: Provide specific reasons, purpose or benefits of accomplishing the goal.
- Who: Who is involved?
- Where: Identify a location.
- Which: Identify requirements and constraints_
Measurable: The second criterion focuses on the need for a standards of measuring progress. This is necessary for a team to know that they are working towards the successful completion of a goal. Measurement helps a team stay on track, develop benchmarks, meet deadlines and helps evaluate progress. Indicators should be quantifiable. Questions to ask to see if a goal is measurable include:
- How much?
- How many?
- How will I know when it is accomplished?
**Achievable: ** The third criterion focuses on the need for goals to be realistic, attainable and within reach. It highlights the importance of being aware of the team’s capacity and resourcefulness. The article explains that “when you identify goals that are most important to you, you begin to figure out ways you can make them come true. You develop the attitudes, abilities, skills and financial capacity to reach them. The theory states that an attainable goal may cause goal-setters to identify previously overlooked opportunities to bring themselves closer to the achievement of their goals.” Questions to ask when trying to evaluate whether a goal is achievable include:
- How can the goal be accomplished?
- How realistic is the goal based on other constraints?
Relevant: The fourth criterion focuses on identifying whether the goal matters to the mission of the organization, program, or department. Goals need to be relevant and important to supervisors, bosses, the team, and the organization. A common understanding of relevance and why goals are important can unify teams and drive progress. It is important for goals to be relevant, but it is also crucial for goals to be aligned with other goals for a coordinated and cohesive approach to achieving all of the goals. Answering yes to the following questions would help evaluate where a goal is relevant:
-Does this seem worthwhile?
- Is this the right time?
- Does this match our other efforts/needs?
- Are you the right person?
- Is it applicable in the current socio-economic environment?
Time-bound: The fifth criterion focuses on having a clear time-frame and target dates. Setting deadlines helps the team prioritize and structure resources and time in order to complete tasks. The importance of this criterion is also to establish a sense of urgency and a clear end-date in order to prevent goals from being sidetracked by day-to-day tasks. Questions to ask when evaluating whether a goal is time-bound include:
- What can I do six months from now?
- What can I do six weeks from now?
- What can I do today?
Other authors have added additional letters to SMART. Here are some examples:
- Evaluated and reviewed
- Evaluate consistently and recognize mastery
Trackable and agreed
Realistic and relevance ('Realistic' refers to something that can be done given the available resources. 'Relevance' to the bigger picture and vision.)