In Conversation with Jonathan Dursi, CanDIG Project & SickKids

Jonathan Dursi on the launch of the first-of-its-kind CINECA project, democratizing health data across continents

The Common Infrastructure for National Cohorts in Europe, Canada and Africa (CINECA) allows unprecedented insight by enabling search and analyses of genomic and clinical health data of 1.4 million people. CINECA is a federated network of partners across three continents. The platform unlocks the potential to study rare diseases and genomic variants.  It creates an interoperable health data system to accelerate research and advance benefits to patients on an extraordinary scale.

Launched in January 2019, the CINECA project was developed in partnership with 19 organizations including the Canadian Distributed Infrastructure for Genomics (CanDIG) and The Hospital for Sick Children (SickKids) which created the federated system for data sharing. We talked to Jonathan Dursi, Technical Lead of the CanDIG project and Senior Research Associate in the Centre for Computational Medicine at SickKids.


What led to the development of CINECA?

Biology is very complicated. Researchers need to be able to look at the combined data from a large number of human subjects to highlight signals and remove data noise in genomic or health data. For each person, we need to consider their clinical data and genomic data to understand what exacerbates a condition. We need lots of data to understand these things and amassing that data is well beyond the scale of any single hospital to do.

We are starting to understand that the mass of data, is not just sample size, but the diversity of people, is crucial to understanding biology and medicine. It requires more data than a single country can provide.  In that regard, we launched the CINECA project in collaboration with a number of institutions around the world. CINECA allows authorized researchers to perform analyses on a cohort of patients from across countries to answer basic questions about biology and medicine. Authorized clinicians can look at genomic variants to identify what might cause a disease in one patient. Clinicians can then compare a patient’s results to a large number of healthy people across the world and from different backgrounds. This will help explain whether it is a particular variant causing the disease. We can also try to identify if drug-targeting can be explored to help that patient.

How was CanDIG involved in this collaboration?

CINECA is about bringing cohorts of people together so researchers can send their analyses of the data for those cohorts and combine the answers to produce something that is more than just the sum of smaller studies. To achieve this, we need technical systems in place, so CanDIG, the European Genome-phenome Archive (EGA) and the European Life-Sciences Infrastructure for Biological Information (ELIXIR) needed to be interoperable. More importantly, they also needed to be federated because we cannot technically or legally bring all this data together in one place.

CanDIG, a project run out of HPC4Health, with Michael Brudno and Carl Virtanen as the principal investigators (PIs) has extensive experience in federating access to health data. With security and privacy so important now, and rightly so, and with the scale of the data coming in, Europe is at the point where they need to federate access to their data, even internally within the EU.  CanDIG joined the CINECA project as a peer of EGA and ELIXIR, the two big European bioinformatics consortia, and leads the work on federated access queries which forms one of the technical components of the CINECA project.


Can you elaborate a bit further on how CanDIG supports this system?

On a technical level, we are interacting with other work groups on aspects like ensuring our authorization and authentication are interoperable. This includes making sure that one can run bioinformatic analysis on the “virtual-cohorts” once the researcher finds patients or subjects of interest in this huge distributed data set. The researcher can then get answers without having to examine or even access individual data from each patient.

On the scientific side, our Canadian PIs are data stewards of important cohorts. Guillaume Bourque represents a number of Quebec based cohorts; Fiona Brinkman, the lead Canadian PI of CINECA is very interested in longitudinal data and is involved with the CHILD cohort, which is a longitudinal study of children. Canada is heavily participating in CINECA, and CanDIG is taking the lead in the areas we have built expertise in, such as the interoperability, federation and distributed access of data while allowing local data stewards to have control of their data even when they make it available for analysis.


Why do we need access to so much data?

It gives us scale and diversity. Canadian efforts in genomic data collection are growing in ambition, but more is needed to continue developing further research.  We only have a modest number of patients who have large amounts of their genomic and clinical health data available, and it’s certainly not evenly distributed across the country. Additionally, incorporation of clinical health data is still in early stages in national projects so the CINECA project can be instrumental in changing that.

The clearest advantage for clinicians is determining who else has a patient who has the same genomic variant in a particular gene or who has experienced this sort of disease. In both cases, we can be talking about very small subsets of the world’s population.


Collaborating across three continents to develop a federated system of sharing health data is very ambitious. How did this collaboration happen?  And what were the resources available that made this possible?

CINECA is a joint project between the EU and the Canadian Institute of Health Research (CIHR). Our Canadian PIs on this project have long-established and strong international networks and collaborations with several participating groups.  Further, the EGA and CanDIG are driver projects for the Global Alliance for Genomics and Health (GA4GH).  The GA4GH is the body that sets standards for building interoperable data models and Application Program Interfaces (APIs) for responsible data access for genomics and health information. ELIXIR participates in GA4GH as a strategic partner.

The EGA, ELIXIR, and CanDIG already had a year or so of experience interacting with each other through these forums. We have started to build some of these technical aspects that support similar projects. An important part of the call for proposals we were answering with our CINECA proposal was about the enabling technology that would allow for this kind of intercontinental science. Our year-long experience of working together to build these pieces made it clear to us that we could reasonably propose something like this.


How do you think you can further the CINECA project?

This is a very ambitious project, and we are arranging it in such a way that all the technical work of this project will happen within the GA4GH standard process. We hope our efforts will continue to help build standards. It is important for us that CINECA does not end up as just a project running on a few systems at a few sites, but that it eventually becomes the interoperable standard that others will be able to join with and build on.


Were there any roadblocks while trying to work on such an ambitious project?

The biggest roadblock is the willingness to share data; because it belongs to real people and there is a genuine ethical concern to share this data with researchers in other countries, even if they have the best intentions. In previous models, people could download the data they wanted to study, which can be problematic if someone’s device was lost or stolen. So, there was a lot of worry about people losing control over sensitive patient data that could result in privacy breaches.

CanDIG’s and GA4GH’s expertise in making data available for analysis without providing direct access to it made participating data stewards more comfortable. With this system, researchers can run their analyses and get only the results back. It’s the concept that we have been building over the last few years and this has made people more comfortable with the project.


Lastly, can you elaborate on how the CINECA project will affect future studies?

We are starting to understand that the world is going towards API access and what you don’t want is for everyone to build their own incompatible wheel because it’s time-consuming and you cannot create anything bigger than that project. By working through GA4GH as a standard-setting body, we will design and build systems that can hopefully be used by others and slowly be extended as needed. We want this standardization to not only help science projects that CINECA supports but also, help build a long-standing platform that others can take advantage of now and in the future.

This is a wildly ambitious project, which hopes to improve the lives of millions by changing the way we approach health care globally.  It can potentially improve the prevention and treatment of rare diseases through a better understanding of human biology.

Click here for the official press release.

Compute Ontario sincerely thanks Dr. Jonathan Dursi for this interview.