Scientific Abstract
Background: This study aims to identify a cohort of patients with diagnoses of a psychotic disorder and posttraumatic stress disorder (PTSD), develop clinically-informed guidelines for annotating these health records for instances of traumatic events to create a gold standard publicly available dataset, and demonstrate that the data gathered using this annotation scheme is suitable for training a machine learning (ML) model to identify these indicators of trauma in novel (unseen) health records.
Methods: We created a representative corpus of 101 psychosis/PTSD diagnosis patient electronic health records (EHRs) from a centralized database. We also developed a detailed annotation scheme for annotating information relevant to traumatic events in the clinical narratives of psychiatric EHRs. Clinical experts annotated the dataset and iteratively updated the annotation guidelines in collaboration with a group of computational linguistic specialists. Following the annotation phase, the dataset was adjudicated to resolve disagreements. Baseline span labeling and relation extraction ML models were developed.
Results: Inter-annotator agreement was high (0.688 for span tags, 0.589 for relations, and 0.874 for tag attributes). The baseline models also performed well (0.571 span and 0.738 relations). Major points and guiding principles relating to annotation process and guideline creation were described and their corresponding justifications were discussed. We verified that the annotation schema is learnable by the baseline ML model.
Conclusions: The performance of the baseline models demonstrate the practical viability of the gold standard TEPcorpus for ML applications.