**Sample Size in Bioequivalence Cross-Over Trials with Balanced Incomplete Block Design**

Lina Hahn^{1}, Gerhard Nehmiz^{2}, Jan Beyersmann^{1}, Salome Mack^{2}^{1}Universität Ulm, Institut für Statistik, Deutschland; ^{2}Boehringer Ingelheim Pharma GmbH&Co. KG, Biberach

In cross-over trials, all subjects receiving the same sequence of treatments form one sequence group, so there are s sequence groups of equal size n/s. If, due to limitations, the number of periods (p) is smaller than the number of treatments (t), we have an Incomplete Block Design. If the allocation of treatments to sequence groups is balanced, it is a Balanced Incomplete Block Design (BIBD).

Necessary conditions are: If r is the number of different sequences each treatment appears in, and if lambda is the number of different sequences in which each treatment pair occurs, balance implies r = p * s / t and lambda = r * (p-1) / (t-1). BIBDs exist for any t and p but can become large (Finney 1963). We investigate two examples with reasonable size, an internal one and the example of Senn (1997).

In medical trials, furthermore, period effects are likely, and in the BIBD the allocation of treatments to periods has also to be balanced, so that treatment contrasts can be estimated unbiasedly. Sufficient is that s is an integer multiple of t, or equivalently r is an integer multiple of p (Hartley 1953). Our two examples fulfil this.

Let the linear model for the measurements y be as usual with i.i.d. random subject effects and fixed terms for treatment, period and sequence group (absorbed by subject effect). All treatment contrasts can then be estimated in an unbiased manner, and if all error terms are i.i.d. N(0,sigma_e^2) and the subject effects are independent from these, the variance of the contrasts can be estimated as well. While in a complete cross-over the contrast variance is 2*sigma_e^2, it is in a BIBD generically b_k*sigma_e^2 where b_k is the “design factor”. We obtain b_k = (2*p*s) / (lambda*t).

Bioequivalence is investigated through the two one-sided tests procedure (TOST) for a treatment contrast (Schuirmann 1987). We investigate the power of the TOST in the two examples, considering the t distribution (Shen 2015, Labes 2020) and comparing it with a previous normal approximation which induces slight underpowerment.

**Untersuchung der Qualität der Berichterstattung in RCT Abstracts zu COVID-19 nach CONSORT (CoCo- Studie) – Zwischenbericht eines Reviews**

Sabrina Tulka, Christine Baulig, Stephanie Knippschild*Lehrstuhl für Medizinische Biometrie und Epidemiologie, Universität Witten/Herdecke, Germany*

Hintergrund: Im Jahr 2020 führte die globale COVID-19- Krise aufgrund ihrer Brisanz und Dringlichkeit zu beschleunigter Forschungstätigkeit und Peer-Review-Verfahren. Obwohl die Volltexte zurzeit frei verfügbar sind, sind diese nicht automatisch auch frei zugänglich (z.B. nicht englischsprachig verfasst)! Zusätzlich zwingt ein hoher Zeitdruck medizinisches Personal oftmals dazu, sich ausschließlich über Abstracts einen ersten Überblick in speziellen Themengebieten zu verschaffen. Hierdurch kommt den Abstracts eine Schlüsselrolle zu und bildet nicht selten die Grundlage für Entscheidungen. Das CONSORT-Statement für Abstracts stellt allen Autoren einen Leitfaden zur Verfügung, um die Qualität (Vollständigkeit und Transparenz) der Berichterstattung medizinischer Forschung (auch in Abstracts) zu gewährleisten. Ziel dieser Studie war es die Vollständigkeit in den Abstracts zu allen bisher veröffentlichten COVID-19 RCTs zu untersuchen.

Methoden: Mittels Literaturrecherche in PubMed und Embase wurden alle Publikationen bis zum 29.10.2020 gesucht und hinsichtlich des Themengebietes (berichtet Ergebnisse zu Corona-Studien) und ihres Studiendesigns (RCT) überprüft. Anschließend erfolgte für geeignete Publikationen zum einen die Untersuchung auf Vollständigkeit der Informationen (Information generell aufzufinden) und zum anderen die Prüfung auf Korrektheit (Informationen, gemäß CONSORT für Abstracts berichtet). Grundlage stellte die CONSORT Checkliste für RCT-Abstracts mit insgesamt 16 Items dar. Die Prüfung erfolgte unabhängig durch zwei Bewerter und wurde anschließend konsentiert. Primärer Endpunkt der Studie war der Anteil korrekt umgesetzter CONSORT-Items. Sekundär wurde die Häufigkeit der korrekten Berichterstattung jedes einzelnen Items geprüft.

Ergebnisse: Von insgesamt 88 Publikationen konnten 30 als Veröffentlichung einer RCT in die Analyse eingeschlossen werden. Im Median berichteten die untersuchten Abstracts einen Anteil von 63% der geforderten Kriterien (Quartilspanne: 44% bis 88%, Minimum: 25%, Maximum: 100%). Korrekt umgesetzt wurden im Median 50% der Kriterien (Quartilspanne: 31% bis 70%, Minimum: 12.5%, Maximum: 87.5%). Die „Anzahl der analysierten Patienten“ (20%) und „vollständige Ergebnisse zum primären Endpunkt“ (37%) wurden am seltensten berichtet. Angaben zur Intervention waren in 97% der Abstracts zu finden, aber nur in 43% der Abstracts auch korrekt (vollständig).

Diskussion: Es zeigte sich, dass die Hälfte aller Abstracts maximal die Hälfte aller notwendigen Informationen enthielt. Als besonders auffällig sind hier die Unvollständigkeit zur finalen Patientenzahl, der Ergebnispräsentation sowie zu den jeweils eingesetzten Therapieansätzen für alle Gruppen hervor zu heben. Da (trotz häufiger Verfügbarkeit) in einer schnelllebigen Krisensituation, wie der COVID-19-Pandemie, nur wenig Zeit für eine vollständige und kritische Sichtung aller Volltexte vorhanden ist, müssen Informationen in Studienabstracts vollständig und transparent beschrieben sein. Unsere Untersuchung zeigt, dass ein deutlicher Handlungsbedarf hinsichtlich der Berichtqualität in Abstracts besteht und stellt einen Apell an alle Autoren dar.

**A Scrum related interdisciplinary project between Reproductive Toxicology and Nonclinical Statistics to improve data transfer, statistical strategy and knowledge generation**

Monika Brüning^{1}, Bernd Baier^{2}, Eugen Fischer^{2}, Gaurav Berry^{2}, Bernd-Wolfgang Igl^{1}^{1}Nonclinical Statistics, Biostatistics and Data Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riss, Germany; ^{2}Reproductive Toxicology, Nonclinical Drug Safety, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riss, Germany

The development of a new drug is a long journey full of important milestones that have to be reached successfully. After identification of a pharmaceutical candidate substance, nonclinical developmental and reproductive toxicology (DART) studies are one important element to assess the safety of the future drug. Recommendations on study design and conduct are given in ICH Guideline S5 to support human clinical trials and market access of new pharmaceuticals involving various phase-dependent designs incl. a huge number of parameters.

DART studies have to be performed in animal models and aim to detect any effect of the test item on mammalian reproduction relevant for human risk assessment.

In general, reproductive toxicology study data involve complex correlation structures between mother and offspring, e.g. maternal weight development, fetus weight, ossification status and number of littermates all in dependence of different test item doses. Thus, from a statistical point of view, DART studies are highly demanding and interesting. This complexity is not reflected in statistical approaches implemented in standard lab software.

To this end, we have developed a Scrum inspired project to intensify the cooperation between Reproductive Toxicology and Nonclinical Statistics to work according to agile principles. Therein, we e.g. defined processes for data transfer and analysis incl. a sophisticated and scientifically state-of-the-art statistical methodology.

In this work, we will mainly focus on technical aspects for constructing an Analysis Data Set (ADS) involving regulatory requirements by CDISC SEND (Standard for Exchange of Nonclinical Data), but also sketch new concepts for visualization and statistical analysis.

**Statistical Cure of Cancer in Schleswig-Holstein**

Johann Mattutat^{1}, Nora Eisemann^{2}, Alexander Katalinic^{1,2}^{1}Institute for Cancer Epidemiology, University of Lübeck, Germany; ^{2}Institute of Social Medicine and Epidemiology, University of Lübeck, Germany

Cancer patients who have survived their treatment and who have been released into remission still live with the uncertainty of late recurrences of their disease. Yet, studies have shown that the observed mortality in the patient group converges against the overall population mortality after some time for most cancer entities. The amount of excess mortality and its time course can be estimated. The time point at which it falls below a defined threshold can be interpreted as „statistical cancer cure“.

This contribution shall focus on the workflow estimating the time point of statistical cancer cure. We briefly explain each step, describe design choices, and report some exemplary results for colorectal cancer. The calculations are based on data provided by the cancer registry of Schleswig-Holstein. First, a threshold for “statistical cancer cure” is defined. Then, missing information on tumor stage at diagnosis is imputed using multiple imputation. The net survival depending on cancer entity, sex, tumor stage at diagnosis (UICC) and age is estimated using a flexible excess hazard regression model implemented in the R package “mexhaz”. The excess mortality is derived using Gauss-Legendre quadrature. Finally, survival rates are estimated conditionally on the survival time since diagnosis and the defined thresholds are applied to get the estimated time point of cure.

We focus on the probability of belonging to the group that will suffer future excess mortality and define the time point at which this probability falls below 5% as time of “statistical cancer cure”. For the example of colorectal cancer (C18-C21) diagnosed in a local stage (UICC II), this probability amounts to approximately 16% at the time of diagnosis and statistical cure is reached after 4.2 years.

Results like the ones described above may support cancer patients by removing uncertainty regarding their future prognosis. As a subsequent step, comprehensive data covering most of the common cancer entities are to be generated.

**Biometrical challenges of the Use Case of the Medical Informatics Initiative (MI-I) on „POLypharmacy, drug interActions, Risks” (POLAR_MI)**

Miriam Kesselmeier^{1}, Martin Boeker^{2}, Julia Gantner^{1}, Markus Löffler^{3}, Frank Meineke^{3}, Thomas Peschel^{3}, Jens Przybilla^{3}, André Scherag^{1}, Susann Schulze^{4}, Judith Schuster^{3}, Samira Zeynalova^{3}, Daniela Zöller^{2}^{1}Institute of Medical Statistics, Computer and Data Sciences, Jena University Hospital, Jena, Germany; ^{2}Institute of Medical Biometry and Statistics, University of Freiburg, Freiburg, Germany; ^{3}Institute for Medical Informatics, Statistics and Epidemiology, University Leipzig, Leipzig, Germany; ^{4}Geschäftsbereich Informationstechnologie, Universitätsklinikum Hamburg-Eppendorf, Hamburg, Germany

Introduction: The aim of POLAR_MI is to use (and, where necessary, adapt/develop) methods and processes of the MI-I to contribute to the detection of health risks in patients with polypharmacy. Polypharmacy occurs especially in elderly patients with multi-morbidity. It is associated with an increased risk for medication errors and drug-drug or drug-disease interactions, which either reduce or intensify the desired effect of individual active substances or lead to undesired adverse drug effects. The project involves an interdisciplinary team ranging from medical informatics to pharmacy and clinical pharmacology with experts from 21 institutions, among them 13 university hospitals and their data integration centres (DICs). Here we focus on some of the biometrical challenges of POLAR_MI.

Material and methods: POLAR_MI relies on the infrastructure of the DICs. The tasks of a DIC include the transfer of data from a wide range of data-providing systems, their interoperable integration and processing while ensuring data quality and data protection. Ultimately, DICs should contribute to a data sharing culture in medicine along the FAIR principles. POLAR_MI is designed to utilize (anonymous) data conforming to the MI-I core data set specification. The generic biometrical concept foresees a two-step procedure: 1) Aggregation (including analysis) of individual patient-level data locally at each DIC using distributed computing mechanisms and shared algorithms, because the security of personal data cannot be guaranteed by anonymisation alone. 2) Afterwards, combination of aggregated data across all contributing DICs. The formulation of these steps requires continuous feedback from the other working groups of POLAR_MI.

Results: To practically implement the biometric concept, we have developed multiple, small iterative steps that alternate between pharma and DIC team. These steps are addressed by pilot data use projects initially limited to single DICs. The steps cover definitions of potentially drug-related events (like falls, delirium or acute renal insufficiency), active substances, outcomes and related value ranges or requirements regarding data privacy issues. Additionally, the handling of missing data, possibly non-ignorable heterogeneity between the DICs as well as inclusion and exclusion criteria for the different research questions of POLAR_MI were discussed.

Conclusion: Based on the results of the pilot data use projects, analyses (approaches) of the main hypotheses of POLAR_MI will be developed. The iterative workflow and the necessary steps presented here may serve as a blue print for other projects using real world data.

**DNT: An R package for differential network testing, with an application to intensive care medicine**

Roman Schefzik, Leonie Boland, Bianka Hahn, Thomas Kirschning, Holger Lindner, Manfred Thiel, Verena Schneider-Lindner*Medical Faculty Mannheim, Heidelberg University, Germany*

Statistical network analyses have become popular in many scientific disciplines, where a specific and important task is to test for significant differences between two networks. In our R package DNT, which will be made available at https://github.com/RomanSchefzik/DNT, we implement an overall frame for differential network testing procedures that differ with respect to (1) the network estimation method (typically based on specific concepts of association) and (2) the network characteristic employed to measure the difference. Using permutation-based tests with variants for paired and unpaired settings, our approach is general and applicable to various overall, node-specific or edge-specific network difference characteristics. Moreover, tools for visual comparison of two networks are implemented. Along with the package, we provide a corresponding user-friendly R Shiny application.

Exemplarily, we demonstrate the usefulness of our package in a novel application to a specific issue in intensive care medicine. In particular, we show that statistical network comparisons based on parameters representing the main organ systems are beneficial for the evaluation of the prognosis of critically ill patients in the intensive care unit (ICU), using patient data from the surgical ICU of the University Medical Centre Mannheim, Germany. We specifically consider both cross-sectional comparisons between a non-survivor and a survivor group (identified from the electronic medical records by using a combined risk set sampling and propensity score matching) and longitudinal comparisons at two different, clinically relevant time points during the ICU stay: first, after admission, and second, at an event stage prior to death in non-survivors or a matching time point in survivors. While specific outcomes depend on the considered network estimation method and network difference characteristic, there are however some overarching observations. For instance, we overall discover relevant changes of organ system interactions in critically ill patients in the course of ICU treatment in that while the network structures at admission stage tend to look fairly similar among survivors and non-survivors, the corresponding networks at event stage differ substantially. In particular, organ system interactions appear to stabilize for survivors, while they do not or even deteriorate for non-survivors. Moreover, on an edge-specific level, a positive association between creatinine and C-reactive protein is typically present in all the considered networks except for the non-survivor networks at the event stage.

**DIFFERENT STATISTICAL STRATEGIES FOR THE ANALYSIS OF IN VIVO ALKALINE COMET ASSAY DATA**

Timur Tug^{1}, Annette Bitsch^{2}, Frank Bringezu^{3}, Steffi Chang^{4}, Julia Duda^{1}, Martina Dammann^{5}, Roland Frötschl^{6}, Volker Harm^{7}, Bernd-Wolfgang Igl^{8}, Marco Jarzombek^{13}, Rupert Kellner^{2}, Fabian Kriegel^{13}, Jasmin Lott^{8}, Stefan Pfuhler^{9}, Ulla Plappert-Helbig^{10}, Markus Schulz^{4}, Lea Vaas^{7}, Marie Vasquez^{12}, Dietmar Zellner^{8}, Christina Ziemann^{2}, Verena Ziegler^{11}, Katja Ickstadt^{1}^{1}Department of Statistics, TU Dortmund University, Dortmund, Germany; ^{2}Fraunhofer Institute for Toxicology and Experimental Medicine ITEM, Hannover, Germany; ^{3}Merck KGaA, Biopharma and Non-Clinical Safety, Darmstadt, Germany; ^{4}ICCR-Roßdorf GmbH, Rossdorf, Germany; ^{5}BASF SE, Ludwigshafen am Rhein, Germany; ^{6}Federal Institute for Drugs and Medical Devices (BfArM), Bonn, Germany; ^{7}Bayer AG, Berlin, Germany; ^{8}Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riss, Germany; ^{9}Procter & Gamble, Cincinnati, Ohio, USA; ^{10}Lörrach, Germany; ^{11}Bayer AG, Wuppertal, Germany; ^{12}Helix3 Inc, Morrisville, NC, USA; ^{13}NUVISAN ICB GmbH, Preclinical Compound Profiling,Germany

The in vivo alkaline Comet or single cell gel electrophoresis assay is a standard test in genetic toxicology for measuring DNA damage and repair at an individual cell level. It is a sensitive, fast and simple method to detect single or double strand breaks and therefore, a widespread technique used in several regulatory frameworks today. In 2016, several nonclinical statisticians and toxicologists from academia, industry and one regulatory body founded a working group “Statistics” within the “Gesellschaft für Umwelt-Mutationsforschung e.V.” (GUM). Currently, this interdisciplinary group has collected data from more than 200 experiments performed in various companies to take a closer look on various aspects of the statistical analysis of Comet data.

In this work, we will sketch the assay and related data processing strategies itself. Moreover, we will briefly describe the effect of different summarizing techniques for transferring data from the cell to the slide or animal level, which might influence the final outcome of the test dramatically. Finally, we will present results of various inferential statistical models incl. their comparisons with a special focus on the involvement of historical control data.

**Meta-Cox-regression in DataSHIELD – Federated time-to-event-analysis under data protection constraints**

Ghislain N. Sofack^{1}, Daniela Zöller^{1}, Saskia Kiefer^{1}, Denis Gebele^{1}, Sebastian Fähndrich^{2}, Friedrich Kadgien^{2}, Dennis Hasenpflug^{3}^{1}Institut für Medizinische Biometrie und Statistik (IMBI),Universität Freiburg; ^{2}Department Innere Medizin, Universitätsklinikum Freiburg; ^{3}Datenintegrationszentrum, Philipps-Universität Marburg

Introduction/Background

Studies published so far suggest that Chronic Obstructive Pulmonary Disease (COPD) may be associated with higher rates of mortality in patients with coronavirus disease 2019 (COVID-19). However, the number of cases at a single site is often rather small, making statistical analysis challenging. To address this problem, the data from several sites from the MIRACUM consortium shall be combined. Due to the sensitivity of individual-level data, ethical and practical considerations related to data transmission, and institutional policies, individual-level data cannot be shared. As an alternative, the DataSHIELD framework based on the statistical programming language R can be used. Here, the individual-level data remain within each site and only anonymous aggregated data are shared.

Problem statement

Up to now, no time-to-event analysis methods are implemented in DataSHIELD. We aim at implementing a meta-regression approach based on the Cox-model in DataSHIELD where only anonymous aggregated data are shared, while simultaneously allowing for explorative, interactive modelling. The approach will be exemplarily applied to explore differences in survival between COVID-19 patients with and those without COPD.

Methods

Firstly, we present the development of a server-side and client-side DataSHIELD package for calculating survival objects and performing the Cox proportional hazard regression model on individual data at each site. The sensitive patient-level data stored in each server will be processed locally on R studio and only the less-sensitive intermediate statistics like the coefficient’s matrices and the Variance Covariance matrices are exchanged and combined via Study Level Meta-Analysis (SLMA) regression techniques to obtain a global analysis. We will demonstrate the process of evaluating the output of the local Cox-regressions for data protection breaches. Exemplarily, we will show the results for comparing the survival of COVID-19 patients with and without COPD using the COVID-19 data distributed across different sites of the MIRACUM consortium.

Summary

In conclusion, we provide an implementation for SLMA Cox regression in the DataSHIELD framework to enable explorative and interactive modelling for distributed survival data under data protection constraints. We exemplarily demonstrate its applicability to data from the MIRACUM consortium. By demonstrating the process of evaluating the output of the Cox regression for data protection breaches, we rise awareness for the problem.

**Quantification of severity of alcohol harms from others‘ drinking items using item response theory (IRT)**

Ulrike Grittner^{1}, Kim Bloomfield^{1,2,3,4}, Sandra Kuntsche^{5}, Sarah Callinan^{5}, Oliver Stanesby^{5}, Gerhard Gmel^{6,7,8,9}^{1}Institute of Biometry and Clinical Epidemiology, Charité – Universitätsmedizin Berlin, Germany; ^{2}Centre for Alcohol and Drug Research, Aarhus University, Denmark; ^{3}Research Unit for Health Promotion, University of Southern Denmark, Denmark; ^{4}Alcohol Research Group, Emeryville, CA, USA; ^{5}Centre for Alcohol Policy Research, La Trobe University, Melbourne, Australia; ^{6}Alcohol Treatment Centre, Lausanne University Hospital CHUV, Lausanne, Switzerland; ^{7}Addiction Switzerland, Research Department, Lausanne, Switzerland; ^{8}Centre for Addiction and Mental Health, Institute for Mental Health Policy Research, Toronto, Ontario, Canada; ^{9}University of the West of England, Faculty of Health and Applied Science, Bristol, United Kingdom

Background: Others’ heavy drinking might negatively affect quality of life, mental and physical health as well as work and family situation. However, until now there is little known about which of these experiences is seen as most or least harmful, and who is most affected.

Methods: Data stem from large population-based surveys from 10 countries of the GENAHTO project (GENAHTO: Gender & Alcohol’s Harms to Others, www.genahto.org ). Questions about harms from others’ heavy drinking concern verbal and physical harm, damage of clothing, belongings or properties, traffic accidents, harassment, threatening behaviour, family problems, problems with friends, problems at work, and financial problems. We used item response theory (IRT) methods (two-parameter logistic (2PL) model) to allow for scaling of the aforementioned items for each country separately. To acknowledge culturally-related sensibilities to experiences of harms in different countries, we also used differential item functioning (DIF). This resulted in country-wise standardised person-based parameters for each individual of each country indicating a quantified measure of load of AHTO. In multiple linear mixed models (random intercept for country) we analysed how load of AHTO was related to sex, age, own drinking and education.

Results: Younger age, female sex and higher level of own drinking were related to a higher load of AHTO. However, interaction of age and own drinking indicated that only for younger age did own drinking level play a role.

Conclusions: Using IRT, we were able to evaluate differing grades of severity in the experiences of harm from others’ heavy drinking.

**A note on Rogan-Gladen estimate of the prevalence**

Barbora Kessel, Berit Lange*Helmholtz Zentrum für Infektionsforschung, Germany*

When estimating prevalence based on data obtained by an imperfect diagnostic test, an adjustment for the sensitivity and specificity of the test is desired. The test characteristics are usually determined in a validation study and are known only with an uncertainty, which should be accounted for as well. The classical Rogan-Gladen correction [4] comes with an approximate confidence interval based on normality and the delta method. However, in literature it was found to have lower than the nominal coverage when prevalence is low and both sensitivity and specificity are close to 1 [2]. In a recent simulation study [1] the empirical coverage of a nominal 95% Rogan-Gladen confidence interval was mostly below 90% and as low as 70% over a wide range of setups. These results are much worse than those reported in [2] and make Rogan-Gladen interval not recommendable in practice. Since we are interested in applying the Rogan-Gladen method to estimate seroprevalence of SARS-CoV-2 infections, like it was done e.g. in [5], we will present detailed simulation results clarifying the properties of the Rogan-Gladen method in setups with low true prevalences and high specificities as being seen in the current seroprevalence studies of SARS-CoV-2 infections, see e.g. the overview [3]. We will also take into account that in the actual studies, the final estimate is often a weighted average of prevalences in subgroups of the population. We will make recommendations when the modification of the procedure suggested by Lang and Reiczigel [2] is necessary. To conclude, we would like to note that since the uncertainties in the sensitivity and specificity estimates used for the correction influence the uncertainty of the corrected prevalence, it is highly desirable to always state not only the values of the test characteristics but also the values of their uncertainties used for the correction. Reporting also the crude uncorrected prevalences enhances the future re-use of the results e.g. in meta-analyses.

[1] Flor M, Weiß M, Selhorst T et al (2020). BMC Public Health 20:1135, doi: 10.1186/s12889-020-09177-4

[2] Lang Z, Reiczigel J. (2014). Preventive Veterinary Medicine 113, pp. 13–22, doi: 10.1016/j.prevetmed.2013.09.015

[3] Neuhauser H, Thamm R, Buttmann-Schweiger N et al. (2020). Epid Bull 50, pp. 3–6; doi: 10.25646/7728

[4] Rogan WJ, Gladen B (1978). American Journal of Epidemiology 107(1), pp. 71–76, doi: 10.1093/oxfordjournals.aje.a112510

[5] Santos-Hövener C, Neuhauser HK, Schaffrath Rosario A et al. (2020). Euro Surveill 25(47):pii=2001752, doi: 10.2807/1560-7917.ES.2020.25.47.2001752

**On variance estimation for the one-sample log-rank test**

Moritz Fabian Danzer, Andreas Faldum, Rene Schmidt*Institute of Biostatistics and Clinical Research, University of Münster, Germany*

Time-to-event endpoints show an increasing popularity in phase II cancer trials. The standard statistical tool for such endpoints in one-armed trials is the one-sample log-rank test. It is widely known, that the asymptotic providing the correctness of this test does not come into effect to full extent for small sample sizes. There have already been some attempts to solve this problem. While some do not allow easy power and sample size calculations, others lack a clear theoretical motivation and require further considerations. The problem itself can partly be attributed to the dependence of the compensated counting process and its variance estimator. We provide a framework in which the variance estimator can be flexibly adopted to the present situation while maintaining its asymptotical properties. We exemplarily suggest a variance estimator which is uncorrelated to the compensated counting process. Furthermore, we provide sample size and power calculations for any approach fitting into our framework. Finally, we compare several methods via simulation studies and the hypothetical setup of a Phase II trial based on real world data.

**Visualizing uncertainty in diagnostic accuracy studies using comparison regions**

Werner Vach^{1}, Maren Eckert^{2}^{1}Basel Academy for Quality and Research in Medicine, Switzerland; ^{2}Institute of Medical Biometry and Statistics, University of Freiburg, Germany

The analysis of diagnostic accuracy can be often seen as a two-dimensional estimation problem. The interest is in pairs such as sensitivity and specificity, positive and negative predictive value, or positive and negative likelihood ratio. In visualizing the joint uncertainty in the two-parameter estimate, confidence regions are an obvious choice.

However, Eckert and Vach (2020) recently pointed out, that this a suboptimal approach. Two-dimensional confidence regions support the post-hoc testing of point hypotheses, whereas the evaluation of diagnostic accuracy is related to testing hypotheses on linear combination of parameters (Vach et al. 2012). Consequently, Eckert and Vach suggest the use of comparison regions, supporting such post-hoc tests.

In this poster we illustrate the use of comparison regions in visualizing uncertainty using the results of a published paired diagnostic accuracy study (Ng et al 2008) and contrast it with the use of confidence regions. Both LR-test based and Wald-test based regions are considered. The regions are supplemented by (reference) lines that allow judging possible statements about certain weighted averages of the parameters of interest. We consider the change in sensitivity and specificity as well as the change in the relative frequency of true positive and false positive test results. As the prevalence of the disease state of interest is low in this study, the two approaches give very different results.

Finally, we give some recommendation on the use of comparison regions in analysing diagnostic accuracy studies.

References:

Eckert M, Vach W. On the use of comparison regions in visualizing stochastic uncertaint in some two‐parameter estimation problems. Biometrical Journal. 2020; 62: 598–609.

Vach W, Gerke O, Høilund-Carlsen PF. Three principles to define the success of a diagnostic study could be identified. J Clin Epidemiol. 2012; 65:293-300.

Ng SH, Chan SC, Liao CT, Chang JT, Ko SF, Wang HM, Chin SC, Lin CY, Huang SF, Yen TC. Distant metastases and synchronous second primary tumors in patients with newly diagnosed oropharyngeal and hypopharyngeal carcinomas: evaluation of (18)F-FDG PET and extended-field multi-detector row CT. Neuroradiology 2008; 50:969-79.