Data Analysis

Exploratory Data Analysis

Four instruments were used to consider the reviewed OER: IMET, EQuIP, Achieve OER (selected scales), and Reviewer Comments. A fifth instrument, the CCSS Worksheet, helped provide foundational information for the other four but was not reported. As noted earlier, the IMET rubric was not used in the ELA review due to the unit level nature of the ELA materials.

Each instrument had one or more scales comprised of one or more items. For example, in the EQuIP rubric for math, there were four scales, Alignment, Key Shifts, Instructional Support and Assessment. Those scales each had from three to nine questions. Data was aggregated at the scale level.

The Likert scales on the rubrics were converted to an ordinal value, as shown below.

Achieve OER Ratings
Superior3
Strong2
Limited1
Very Weak0
Not Applicable (Not Included in Analysis)-
IMET Ratings
Strongly agree3
Agree2
Disagree1
Strongly disagree0
Reviewer Comments: Amount of Work Required Ratings
Extreme3
Moderate2
Minor1
None0
EQuIP Scale Ratings
Most to all criteria met3
Many criteria met2
Some criteria met1
Does not meet criteria0
EQuIP Overall Ratings
Exemplar11-12
Exemplar if Improved8-10
Revision Needed3-7
Not Ready for Review0-2

Since the Overall Ratings EQuIP scores had unequal intervals between ratings, we did not covert these values to a 0–3 point scale. These scores appear as a separate reporting point and are not considered in any comparison charts showing average scores.

Data was collected using PDF forms that were electronically submitted to OSPI staff. Data was recorded using the conversion tables shown above during the review collection process. The results were compiled into data sets that were then cleaned to use consistent references for unit titles, developers, and other metadata. Note that while some binary data (worksheet check marks) was collected to help reviewers assess the scored items, none of the worksheet check mark data was included in the analysis of average scores. Instead this “How would you use this resource” data appears as a separate chart (see figure 4).

The scope of the data analysis did not involve comparing instructional materials to each other using a combination of all scores and all rubrics. Rather, data was compiled into charts for each unit or course with some limited comparisons between the resources based upon individual items or scales.

An independent review of the data was conducted post-hoc to ensure that the data cleaning and organization steps did not introduce errors. Approximately 10% of the data was selected from the raw submitted files and compared to the final consolidated data set. No errors were detected.

Inter-rater reliability was addressed throughout the data collection process. The reviewers received ongoing training and guidance on standardizing their answers based upon evidence in the text and the detailed instructions found within each of the rubrics. When all the data was submitted for a particular unit or course, a quick analysis of the individual ratings for each of the rubrics was performed. In the instances where there was a difference of more than two points for an individual item, the reviewers who rated that product were given the opportunity to discuss their conclusions and make adjustments as necessary. They were also given clear feedback that they could retain their existing score if they wished.

Mathematics

There were seven full mathematics courses reviewed. Five were Geometry, one was Integrated Math 2, and one was Algebra 1. Each course was assigned to four independent reviewers. As capacity allowed, some resources were examined by five reviewers. CK12 Geometry Concepts, CK12 Honors Geometry Concepts, and EngageNY Geometry each received five independent reviews. In total, there were 31 reviews.

ELA

There were 20 ELA units reviewed. Each unit was assigned to four independent reviewers. Two resources, To Kill a Mockingbird Historical Perspective and NYC Department of Education Speeches, received three reviews. One resource, Saylor Unit 3 Anthem, received five reviews. In total, there were 79 reviews.

Though many of the full-courses and units reviewed in this process were crafted to address the CCSS, several of the resources pre-date the CCSS. Thus, the review process compared these materials against target standards that developers were not originally aiming for at material creation. In those instances, we noticed much higher variation in reviewer scores. Though still within acceptable ranges of inter-rater reliability, interpretation of how well the legacy resources aligned with the new standards was a bit more challenging and open to user interpretation of the resource intent.

Testing Reviewer Bias - Mathematics

We see the average score given by each reviewer, sorted in increasing order, with a 95% confidence interval for the reviewer’s mean score. There do appear to be some slight differences between reviewers, but this may just have been due to chance; some reviewers may have been assigned better curriculum, while others may have reviewed only less aligned curriculum.


Figure 26. Math reviewer average scores with a 95% confidence interval.

In order to test whether any reviewer had a tendency to over- or under-rate, we performed a t-test comparing each reviewer’s average score to the entire sample to test whether the reviewer tended to score away from the mean. The results are shown in Table 1. Since we are performing tests for eight reviewers, it is important to adjust for multiple comparisons to avoid finding a difference significant when it could have happened by chance when drawing eight means from the same distribution. The table gives the adjusted significance level, calculated using the Bonferroni method, in which we compare the ordered p-values to the nominal significance level (0.05) divided by the number of tests remaining. As soon as one test is deemed insignificant, the rest are also. In this case, we see that even the smallest p-value does not fall below its corresponding adjusted significance level, 0.05/10, so we can conclude that there is no evidence of reviewer bias.

Table 1. t-test results for reviewer bias
Reviewerp-valueAdjusted significance level
90.10560.0062
50.21290.0071
10.27940.0083
80.29320.0100
30.51770.0125
100.61960.0167
70.64180.0250
40.67390.0500

Testing Reviewer Bias — ELA

We see the average score given by each reviewer, sorted in increasing order, with a 95% confidence interval for the reviewer’s mean score. There do appear to be some slight differences between reviewers, but this may just have been due to chance; some reviewers may have been assigned better resources, while others may have reviewed only less aligned resources.

In order to test whether any reviewer had a tendency to over- or under-rate, we performed a t-test comparing each reviewer’s average score to the entire sample to test whether the reviewer tended to score away from the mean. The results are shown in Table 2. Since we are performing tests for 10 reviewers, it is important to adjust for multiple comparisons to avoid finding a difference significant when it could have happened by chance when drawing 10 means from the same distribution. The table gives the adjusted significance level, calculated using the Bonferroni method, in which we compare the ordered p-values to the nominal significance level (0.05) divided by the number of tests remaining. As soon as one test is deemed insignificant, the rest are also. In this case, we see that even the smallest p-value does not fall below its corresponding adjusted significance level, 0.05/10, so we can conclude that there is no evidence of reviewer bias.


Figure 27. ELA reviewer average scores with a 95% confidence interval.

Table 2. t-test results for reviewer bias
Reviewerp-valueAdjusted significance level
120.25770.0050
200.30900.0056
160.34250.0062
110.37720.0071
140.46540.0083
190.57730.0100
150.71290.0125
170.74870.0167
180.91080.0250
130.92490.0500