Data Analysis

Exploratory Data Analysis

Four instruments were used to consider the reviewed OER: IMET, EQuIP, Achieve OER (selected scales), and Reviewer Comments. A fifth instrument, the CCSS Worksheet, helped provide foundational information for the other four but was not reported. As noted earlier, the IMET rubric was not used in the ELA review due to the unit level nature of the ELA materials.

Each instrument had one or more scales comprised of one or more items. For example, in the EQuIP rubric for math, there were four scales: Alignment, Key Shifts, Instructional Support and Assessment. Those scales each had from three to nine questions. Data was aggregated at the scale level.

The Likert scales on the rubrics were converted to an ordinal value, as shown below.

Achieve OER Ratings
Superior3
Strong2
Limited1
Very Weak0
IMET Ratings
Strongly agree3
Agree2
Disagree1
Strongly disagree0
Reviewer Comments: Amount of Work Required Ratings
Extreme3
Moderate2
Minor1
None0
EQuIP Scale Ratings
Most to all criteria met3
Many criteria met2
Some criteria met1
Does not meet criteria0
EQuIP Overall Ratings
Exemplar11-12
Exemplar if Improved8-10
Revision Needed3-7
Not Ready for Review0-2

Since the Overall Ratings EQuIP scores had unequal intervals between ratings, we did not covert these values to a 0–3 point scale. These scores appear as a separate reporting point and are not considered in any comparison charts showing average scores.

Data was collected using an online form. Data was recorded using the conversion tables shown above during the review collection process. The results were exported into a spreadsheet and compiled into data sets that were then cleaned to use consistent references for unit titles, developers, and other metadata. Note that while some binary data (worksheet check marks) was collected to help reviewers assess the scored items, none of the worksheet check mark data was included in the analysis of average scores. Instead this “How would you use this resource” data appears as a separate chart (see figure 7).

The scope of the data analysis did not involve comparing instructional materials to each other using a combination of all scores and all rubrics. Rather, data was compiled into charts for each unit or course with some limited comparisons between the resources based upon individual items or scales.

An independent review of the data was conducted post-hoc to ensure that the data cleaning and organization steps did not introduce errors. Approximately 10% of the data was selected from the raw submitted files and compared to the final consolidated data set. No errors were detected.

Inter-rater reliability was addressed throughout the data collection process. The reviewers received ongoing training and guidance on standardizing their answers based upon evidence in the text and the detailed instructions found within each of the rubrics. When all the data was submitted for a particular unit or course, a quick analysis of the individual ratings for each of the rubrics was performed. In the instances where there was a difference of more than two points for an individual item, the reviewers who rated that product were given the opportunity to discuss their conclusions and make adjustments as necessary. They were also given clear feedback that they could retain their existing score if they wished.

Mathematics

There were ten full mathematics courses reviewed. Four were Grade 6, three were Grade 7, and three were Grade 8. Each course was assigned to four independent reviewers. In total, there were 40 reviews.

ELA

There were 20 ELA units reviewed. Each unit was assigned to four independent reviewers. In total, there were 80 reviews.

Though many of the full-courses and units reviewed in this process were crafted to address the CCSS, several of the resources pre-date the CCSS. Thus, the review process compared these materials against target standards that developers were not originally aiming for at material creation. In those instances, we noticed much higher variation in reviewer scores. Though still within acceptable ranges of inter-rater reliability, interpretation of how well the legacy resources aligned with the new standards was a bit more challenging and open to user interpretation of the resource intent.

Testing Reviewer Bias - Mathematics

A technical analysis was performed to check for potential reviewer bias, which is where a reviewer might tend to over- or under-rate the texts reviewed. The results show that there is no evidence of reviewer bias in the data. Figure 31 shows the mean score given by each reviewer, sorted in increasing order, with a 95% confidence interval for the reviewer’s mean score. There are some slight differences between reviewer means, but this may just have been due to chance; some reviewers may have been assigned better texts, while others may have reviewed only poor texts.


Figure 31. Mean score by reviewer with a 95% confidence interval.

A t-test was performed to test whether any reviewer had a tendency to over- or under-rate. The t-test compared each reviewer’s average score to the entire sample to test whether the reviewer tended to score away from the mean. The results are shown in Table 1. Since the test is being performed for 10 reviewers, it is important to adjust for multiple comparisons to avoid finding a difference significant when it could have happened by chance when drawing 10 means from the same distribution. The table gives the adjusted significance level, calculated using the Bonferroni method, in which the ordered p-values are compared to the nominal significance level (0.05) divided by the number of tests remaining. As soon as one test is deemed insignificant, the rest are also. In this case, it is evident that even the smallest p-value does not fall below its corresponding adjusted significance level, 0.05/10, so there is no evidence of reviewer bias.

Table 1. t-test results for reviewer bias
Reviewerp-valueAdjusted significance level
10.16020.0050
60.41750.0056
90.45180.0062
50.46070.0071
100.54460.0083
20.64650.0100
70.74230.0125
30.86830.0167
40.91520.0250
80.98130.0500

Testing Reviewer Bias — ELA

A technical analysis was performed to check for potential reviewer bias, which is where a reviewer might tend to over- or under-rate the texts reviewed. The results show that there is no evidence of reviewer bias in the data. Figure 32 shows the mean score given by each reviewer, sorted in increasing order, with a 95% confidence interval for the reviewer’s mean score. There are some slight differences between reviewer means, but this may just have been due to chance; some reviewers may have been assigned better texts, while others may have reviewed only poor texts.


Figure 32. Mean score by reviewer with a 95% confidence interval.

A t-test was performed to test whether any reviewer had a tendency to over- or under-rate. The t-test compared each reviewer’s average score to the entire sample to test whether the reviewer tended to score away from the mean. The results are shown in Table 1. Since the test is being performed for 10 reviewers, it is important to adjust for multiple comparisons to avoid finding a difference significant when it could have happened by chance when drawing 10 means from the same distribution. The table gives the adjusted significance level, calculated using the Bonferroni method, in which the ordered p-values are compared to the nominal significance level (0.05) divided by the number of tests remaining. As soon as one test is deemed insignificant, the rest are also. In this case, it is evident that even the smallest p-value does not fall below its corresponding adjusted significance level, 0.05/10, so there is no evidence of reviewer bias.

Table 2. t-test results for reviewer bias
Reviewerp-valueAdjusted significance level
120.08770.0050
160.25900.0056
140.36460.0062
150.39410.0071
190.64620.0083
170.67710.0100
130.78350.0125
110.80150.0167
180.83100.0250
200.97260.0500