### Consent waiver

This study was reviewed and approved by the Johns Hopkins University School of Medicine Institutional Review Board and adhered to the tenets of the Declaration of Helsinki. The requirement for informed consent was waived because of the retrospective nature of the study.

### data collection

Demographic and clinical data were obtained from patients seen at the Johns Hopkins Wilmer Eye Institute from June 1990 to June 2020. The clinical assessment of worsening at the last visual field (VF) was extracted from Epic (Verona, Wisconsin). Clinicians rating eyes either possibly or likely worsening on VF testing were labeled as worsening, while other choices (stable, possibly, or likely improving) were labeled as not worsening. The VF data were HVF 24-2 studies extracted from FORUM (Zeiss, Dublin, CA). The majority of these were SITA-Standard but it also included SITA-Fast, full threshold, and SITA-Faster.

VFs were included only if they were considered reliable with less than 15% false positives and less than either 25% false negative for mild/moderate disease or 50% for severe disease^{36}, We only included eyes with at least 7 reliable VFs so that an accurate determination of longitudinal change could be made. The last VF in the series for each eye was required to have a clinician assessment of VF worsening or not worsening recorded in the charts. The number of VF tests excluded at each step is shown in the flow chart (Fig. 1).

### Methods to determine visual field worsening

There is no gold standard to assess VF worsening but there are numerous algorithms that have been commonly employed in the field. We used six of these automated methods. This includes three event-based methods: Guided Progression Analysis (GPA), Advanced Glaucoma Intervention Study (AGIS) scoring system, and Collaborative Initial Glaucoma Treatment Study (CIGTS) scoring system. We also used three trend-based methods: Mean deviation (MD) rate of change (MD slope), VF index (VFI) rate of change (VFI slope), and Pointwise linear regression (PLR). In addition to these algorithms, we also had access to clinician assessment of worsening for the last VF in each series. The description of each of these methods is outlined below. In all event-based methods, a baseline was needed which was calculated as the average of the first two VFs.

GPA is typically calculated by proprietary software and based on the Glaucoma Change Probability Analysis ^{3,21,37}, Deviation values at each point in the VF are compared to the average of the values at the first two VFs. The points with a difference significantly higher than the test–retest variability at ap < 0.05 are identified. As we did not have access to the GPA database for thresholds for test–retest variability we determined thresholds for α < 0.05 based on an empirical normative database from the University of Iowa. We also used total deviation values instead of pattern deviation which is classically used by GPA, as previous studies have shown total deviation is more likely to detect progression^{38}, We defined worsening as any three or more points worsening beyond the threshold level for three consecutive fields compared to the average of the first two VF exams.

AGIS score was calculated for each VF as described in the AGIS trial.^{13}, Briefly, each VF is graded based on the depth and number of defects in pre-specified locations on the VF. These pre-specified locations include the nasal, superior, and inferior hemifields. The score ranges from 0 to 20 and scores for each VF are compared to the baseline scores. A computer program was used to calculate the score^{39}, An AGIS score increase of at least four points which is sustained in three consecutive VFs was classified as worsening.

CIGTS score calculation has been previously described in the CIGTS trial^{15}, This score uses the total deviation probability map and is calculated based on the density and depth of defects across the VF. VFs with multiple isolated points with defects would receive a lower score than when there were clusters of points with defects. The CIGTS score also ranges from 0 to 20 and an increase of three or more test points which is sustained for three consecutive VFs was classified as worsening.

The MD slope was calculated as the simple linear regression of the MD values for the VFs. VF worsening was defined as a negative slope ≤ − 0.5 dB/year with a regression p-value less than 0.05. Similarly, the VFI slope was calculated as the linear regression of the VFI values. VF worsening was defined as a negative slope ≤ − 1.8%/year with a p-value of less than 0.05.^{21},

For PLR, linear regression was performed for the total deviation values of each of the 52 VF points separately. VF worsening was defined as the presence of any three points with a negative slope ≤ − 1 dB/year with a p-value ≤ 0.01^{21},

Clinician assessment of worsening was determined for each eye by the clinician at the time of the last visual field and recorded in Epic. The clinician could choose from checkboxes that denoted likely worsening, possible worsening, stable, possible improvement, or likely improvement. A judgment of likely or possible progression was classified as worsening while all other choices were classified as not worsening.

### Reference standards

A reference standard for VF worsening was defined as at least four out of six algorithms (GPA, AGIS, CIGTS, MD slope, VFI slope, and PLR) identifying worsening. This was used as the label for worsening to train/test the deep learning model (DLM) and serves as the ground truth for VF worsening in this study. This reference was also used as the reference for the receiver-operating characteristic (ROC) curve in Fig. 4. A supplementary analysis was conducted with the clinician assessment of worsening for worsening used as the reference standard for training the DLM and generating the ROC curve (Supplementary Fig. 2).

### Deep learning architecture

The DLM architecture is described in Fig. 1. The input to the network consists of two parts: (1) a set of 7 or more VF images, each image has 54 points which were radially blurred onto a 12 × 12 grid and stacked together; (2) a stack of 7 or more sets of 8 global metrics from each VF (Age, VFI in %, PSD in dB, MD in dB, False Negatives in %, False Positives in %, Test Duration in sec, and Fixation Losses ). The DLM architecture can receive unevenly spaced temporal data from each VF series. The dataset was split into 80%, 10%, and 10% for training, validation, and testing, respectively. The data was split on a patient level so if both eyes were included, they would fall within the same set. Including only one eye from each patient did not change the results of the study. The data were randomly distributed so all datasets, training, validation, and testing consisted of eyes that were and were not determined to be worsening. For the deep learning architecture, we implemented a single 2D convolutional LSTM with a 3 × 3 kernel size. Batch normalization was also integrated into the model to reduce internal covariate shift. The output of the model was the probability of VF worsening.

An additional analysis was carried out by removing VFs from the end of the series of VFs that were included for each eye and re-training the model with fewer data points. This tested the DLM’s ability to judge worsening before it had access to all of the information used by the 4 out 6 algorithms reference standard. The VFs were removed sequentially from the end (removing the final VF, removing the final two VFs, removing the final three VFs, etc.). This was done up to a maximum of removing the final 6 VFs since all included eyes required at least 7 VFs. This allowed each eye to have at least 1 VF entered the model as input, although about 87% of eyes had more than this minimum number. The label for worsening and assessment of performance was still based on the original consensus of 4 out of 6 using all the VFs.

### statistical analysis

Since multiple methods were used to identify VF worsening, we wanted to calculate the level of agreement among these methods. The pairwise agreement was identified based on Cohen’s kappa coefficient. Based on previous literature a kappa coefficient of 0 to 0.2 indicated slight agreement, 0.2 to 0.4 fair agreement, 0.4 to 0.6 moderate agreement, and 0.6 to 0.8 substantial agreement^{40}, Agreement across more than two methods was also determined by calculating the Fleiss’ kappa coefficient.^{41},

Another model for identifying worsening was created using a mixed-effects model that was provided with all the same data as the LSTM (Fig. 3) with “Patient ID” and “Eye ID” treated as random effects and all other features treated as fixed. effects.

For the deep learning prediction, we constructed a ROC curve, which can visualize the performance of the DLM at all classification thresholds (Fig. 4). An AUC value and its 95% confidence interval were calculated as a measure of prediction performance. The Clopper–Pearson method was used to calculate the 95% confidence interval of false positive rates and true positive rates.^{42}, The same approach was used to identify an AUC for the mixed-effects model approach. For clinician assessment of worsening a fixed true positive rate and false positive rate were calculated. An exact ROC curve cannot be calculated for clinician assessment of worsening since it is a discrete and binary classification. To evaluate clinician prediction performance, a best minmax AUC score and its upper and lower bounds were calculated, assuming the clinician ROC curve is concave or monotone.^{43},

Unless specified otherwise all comparisons and performance analyzes were calculated on the test dataset only. The DLM was developed using Python (Python Software Foundation, Wilmington, Delaware). SPSS was used for statistical comparisons (IBM Corp, Armonk, NY).

### Conference presentations

American Glaucoma Society, Paper Presentation, Nashville, TN, 2022.