Data Analysis

Exploratory Data Analysis

Four instruments were used to consider the reviewed OER: Publishers’ Criteria, EQuIP, Achieve OER (selected scales) and Reviewer Comments. A fifth instrument, the CCSS Worksheet, helped provide foundational information for the other four but was not reported. Each instrument had one or more scales comprised of one or more items. For example, in the Publishers’ Criteria rubric for English Language Arts, there were three scales, Quality of Text, Quality of Questions, and Writing. The Quality of Text scale had three questions. Data was aggregated at the scale level.

The Likert scales on the rubrics were converted to an ordinal value, as shown below.

Achieve OER Ratings
Superior3
Strong2
Limited1
Very Weak0
Not Applicable (Not Included in Analysis)-

For the Achieve OER rubric, the Not Applicable ratings were removed from the data analysis. There were 14 instances in mathematics where reviewers selected Not Applicable, almost all in the Quality of Interactivity scale. In the ELA data, the entire Quality of Interactivity scale was removed, because there was virtually no interactivity in the ELA units. While digital in nature, they were primarily static PDFs and web content. The two remaining instances in ELA where reviewers selected Not Applicable were removed from the data analysis.

Tri-State EQuIP Overall Ratings
Exemplar3
Exemplar if Improved2
Revision Needed1
Not Recommended0
Not Ready for Review0

For the EQuIP Overall ratings, both Not Recommended and Not Ready for Review were coded as 0. One of the challenges of converting Likert-type responses to ordinal data is that sometimes the written values do not have equivalent “distances” between each step. The ratings Not Recommended and Not Ready for Review are fairly equivalent in that a reasonable person would not find a distinct advantage from one rating to the other. In comparison, the remaining steps on this scale show positive progression and are coded 1, 2, and 3 respectively.

EQuIP Scale Ratings
Most to all criteria met3
Many criteria met2
Some criteria met1
Does not meet criteria0
Publishers’ Criteria Ratings
Strongly agree3
Agree2
Disagree1
Strongly disagree0
Reviewer Comments: Amount of Work Required Ratings
Extreme0
Moderate1
Minor2
None3
Reviewer Comments: Use this Material in my Classroom Ratings
Strongly agree3
Agree2
Disagree1
Strongly disagree0

Data was collected using PDF forms which were electronically submitted to OSPI staff. The results were compiled into data sets which were then cleaned to use consistent references for unit titles, developers, and other metadata. Data was recoded using the conversion tables shown above. Note that while some binary data (worksheet check-marks) was collected to help reviewers assess the scored items, none of the worksheet check mark data was included in the analysis.

The scope of the data analysis did not involve comparing instructional materials to each other using a combination of all scores and all rubrics. Rather, data was compiled into charts for each unit or course with some limited comparisons between the resources based upon individual items or scales.

An independent review of the data was conducted post-hoc to ensure that the data cleaning and organization steps did not introduce errors. Approximately 10% of the data was selected from the raw submitted files and compared to the final consolidated data set. No errors were detected.

Inter-rater reliability was addressed throughout the data collection process. The reviewers received ongoing training and guidance on standardizing their answers based upon evidence in the text and the detailed instructions found within each of the rubrics. When all the data was submitted for a particular unit or course, a quick analysis of the individual ratings for each of the rubrics was performed. In the instances where there was a difference of more than two points for an individual item, the reviewers who rated that product were given the opportunity to discuss their conclusions and make adjustments as necessary. They were also given clear feedback that they could retain their existing score if they wished.

Mathematics

There were seven full mathematics courses reviewed. Six were Algebra 1 and one was Integrated Math 1. Each course was randomly assigned to five independent reviewers.

Four of the seven curricula had very low variation in responses for all scored elements. Three units had minor variation among 1-3 items, with Curriki having the highest variation with three of thirty items having a variation of more than two on an individual item. All the high variance items were in the EQuIP rubric. In total, 5 of 210 (2.4%) sets of responses for individual items had a difference of more than two in the reviewer responses.

ELA

There were 20 ELA units that were reviewed. Each unit was randomly assigned to four independent reviewers.

Eleven of the twenty ELA units had very low variation in responses for all scored elements. The remaining nine units had 1-3 items each with a variation of more than two points. One item in particular had high variance, the Quality of Interactivity on the Achieve OER rubric, with 5 of the 20 products showing a high variation. This can be explained by a lack of clarity regarding what constitutes interactivity. While the Achieve OER rubric carefully described what would and would not be considered interactive, training and subsequent follow-up did not sufficiently reinforce this direction. For example, opening a PDF or content web page is not considered interactive, but viewing a video or adjusting dynamic values in a table would be considered interactive. ELA had so little interactivity overall that this scale was dropped from the analysis.

Testing Reviewer Bias

For each unit, we assessed the scores to look for evidence of reviewer bias.

Mean score by ELA reviewer with 95% CI
Figure 28. ELA reviewer average scores with a 95% confidence interval.

Figure 28 shows the average score given by each ELA reviewer, sorted in increasing order, with a 95% confidence interval for the reviewer’s mean score. There do appear to be some differences between reviewers, but this may just have been due to chance; some reviewers would have been assigned better OERs, while others may have reviewed only poor OERs. Similarly, Figure 14 shows the average score given by each mathematics reviewer.

Mean score by math reviewer with 95% CI
Figure 29. Math reviewer average scores with a 95% confidence interval.

In order to test whether any reviewer had a tendency to over- or under-rate, we calculated a standardized score within text for each reviewer, and performed a t-test comparing each average standardized score to 0 to test whether the reviewer tended to score away from the mean. The results are shown in Table 1 and Table 2.

Since validity tests were performed for many reviewers, it was important to adjust for multiple comparisons to avoid finding a difference significant when it could have happened by chance when drawing ten means from the same distribution. Each table below gives the adjusted significance level, calculated using the Bonferroni method, in which we compare the ordered p-values to the nominal significance level (0.05) divided by the number of tests remaining. As soon as one test is deemed insignificant, the rest are also. In both cases, we see that even the smallest p-values for mathematics and ELA do not fall below their corresponding adjusted significance level, 0.05/10, so we can conclude that there is no evidence of reviewer bias in either review. Within the tables, the results are presented in the order tested, sorted from most significant to least significant difference.

Table 1. P-values and adjusted significance levels for mathematics reviewers. Note that even the smallest p-value of 0.2425 does not fall below the adjusted significance level of 0.0050, which allows us to conclude that there is no evidence of reviewer bias in mathematics.

Reviewerp-valueAdjusted significance level
30.24250.0050
70.41810.0056
100.42470.0062
50.53600.0071
20.66650.0083
80.71240.0100
10.74240.0125
90.83690.0167
40.90840.0250
60.94340.0500

Table 2. P-values and adjusted significance levels for ELA reviewers. In this instance, the smallest p-value of 0.0694 does not fall below the adjusted significance level of 0.0050, which allows us to conclude that there is no evidence of reviewer bias in ELA.

Adjusted significance level
Reviewerp-value
1090.06940.0050
1040.19250.0056
1020.29530.0062
1100.29750.0071
1050.56170.0083
1060.86310.0100
1070.89280.0125
1080.89800.0167
1010.98480.0250
1030.99950.0500