The data presented in this repository were collected as part of a research project entitled Exploring Writing Achievement and Its Role in Success at 4-Year Postsecondary Institutions. This project was funded by the Institute of Education Science, U.S. Department of Education, Award Number R305A160115, and led by Dr. Jill Burstein (Principal Investigator) and Dr. Daniel McCaffrey (co-Principal Investigator).
The repository contains the following main set of files:
-
writing-samples/*.txt
: De-identified authentic university coursework writing data. There are 997 files in this directory, each representing one of the coursework assignments. 735 students participated in this study. A partially overlapping subset of students (N=418) submitted multiple coursework writing assignments.Participants were enrolled at 4-year universities, and writing assignments were collected from courses primarily targeting first-year students. We refer to these files as the University Coursework Writing Corpus. All assignment files are plaintext and UTF-8 encoded, where necessary.
-
student_data.csv
: Data collected from student participants. Two types of data are collected:-
Writing Attitude Survey Data: This survey measured four components of writing attitudes and beliefs: (1) Goals for Writing, (2) Confidence about Writing Tasks, (3) Beliefs about Writing, and (4) Feelings about Writing. Note that the order of survey questions in this CSV file is not the same as the original order of the questions in the survey. The survey was completed by 566 out of 735 total study participants. For the others, these columns are blank.
-
Outcomes/Success Predictor Measures: A subset of outcomes/success predictor measures for the study participants, including: (1) their course grade (for the course in which writing assignments and surveys were submitted), (2) their study semester GPA, (3) their semester GPA for up to five semesters following study enrollment, and (4) their SAT Total/ACT Composite score (recoded as
SAT Total score
).
In total, there are 735 rows and 64 columns in this CSV file. A detailed description of each column can be found here.
-
-
writing_features.csv
: Various features based on the writing samples in the writing corpus. The following types of data are included:-
Assignment Preparation Survey Data: Survey responses (N=929) collected from students regarding each coursework assignment they submitted.
-
Genre Annotations: Human annotations for each assignment pertaining to assignment type, source requirements, source use, writing aim, and assignment version.
-
Automated Writing Evaluation (AWE) Features: Feature values generated for each of the 997 writing assignments from two different automated writing evaluation (AWE) systems. The features are computed based on grammar and mechanics errors, use of figurative & argumentative language, and vocabulary, among others. The two different AWE systems used to generate these features were e-Rater (Attali & Burstein, 2006) and Writing Mentor (Burstein et al., 2018)
In total, there are 997 rows and 119 columns in this CSV file. A detailed description of each column can be found here.
-
The following additional files can be found under docs
.
-
student_data_columns.csv
: A CSV file describing in detail each of the columns found instudent_data.csv
. -
writing_features_columns.csv
: A CSV file describing in detail each of the columns found inwriting_features.csv
. -
surveys/assignment_preparation_survey.pdf
: The instrument used for the assignment preparation survey. -
surveys/writing_attitudes_survey.pdf
: The instrument used for the writing attitudes survey. It is adapted from an instrument developed by MacArthur, Philippakos, & Graham (2016) to measure motivation among college writers. -
forms/*.pdf
: Blank copies of the student consent forms issued to the student participants. -
processes/deidentification_procedures.pdf
: This file contains a description of the steps that we followed in order to de-identify (remove any personally identifying information from) the writing samples and student metadata. -
processes/genre_annotation.pdf
: This file contains a description of the genre annotation performed to classify writing assignments based on broad assignment types. -
processes/persuasive_subgenre_annotation.pdf
: This file contains a description of the subgenre annotation performed to classify persuasive writing assignments into one of six finer-grained categories based on argument value (on a continuum of low to high), source use and integration, and support.
-
Attali, Y., & Burstein, J. (2006). Automated Essay Scoring with e-rater® V. 2. The Journal of Technology, Learning and Assessment, 4(3).
-
Burstein, J., McCaffrey, D., Beigman Klebanov, B., Ling, G. & Holtzman, S. (2019). Exploring Writing Analytics and Postsecondary Success Indicators. In Companion Proceedings 9^th International Conference on Learning Analytics & Knowledge (LAK19), Tempe, AZ.
-
Burstein, J., Invited Speaker, Writing Analytics: Automated Writing Evaluation Feedback to Support Learning, Learning Analytics Research Network (LEARN) Seminar Series, New York University, February 12, 2019.
-
Burstein J., Invited Speaker, Natural Language Processing for Education: Applications for Reading and Writing Proficiency, The Seventh Workshop on NLP for Computer- Assisted Language Learning (NLP4CALL), University of Stockholm, Stockholm, Sweden, November 7, 2018.
-
Burstein, J., McCaffrey, D., Beigman Klebanov, B., & Ling, G. (2017). Exploring Relationships between Writing and Broader Outcomes with Automated Writing Evaluation. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications (BEA), EMNLP 2017, Copenhagen, Denmark.
-
Burstein, J. Elliot, N., Beigman Klebanov, B., Madnani, N., Napolitano, D., Schwartz, M., Houghton, P. & Molloy, H. (2018). Writing Mentor: Writing Progress Using Self-Regulated Writing Support. Journal of Writing Analytics. Vol. 2: 285-313.
-
MacArthur, C. A., Philippakos, Z. A., & Graham, S. (2016). A Multi-component Measure of Writing Motivation with Basic College Writers. Learning Disability Quarterly, 39, 31-43. DOI: 10.1177/0731948715583115.
This data is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
If you have any questions about the data, please send us an email.
The data provided in this repository was collected via research supported by the Educational Testing Service, and the Institute of Education Science, U.S. Department of Education, Award Number R305A160115.
Thanks to Michael Flor, Binod Gyawali, Ben Leong, and Maxwell Schwartz for engineering support. Many thanks to our research assistants, Patrick Houghton, Hillary Molloy and Zydrune Mladineo, for managing a complex data collection.