Modern science is a global practice. Academic collaborations, multinational corporations, and national labs are all examples where science performed in a single location must be shared or published across the world. While the definition of data publishing might differ for these various entities, the complexity and challenge of sharing GByte- and TByte-sized imaging datasets along with essential metadata and analytics are common problems shared by all.
Regardless of the motivation or implementation, specific technology and expertise is required to deliver high performance, secure, globally accessible, scalable systems for data sharing and publication. Glencoe’s team has been building public and private imaging data repositories since 2005 based on its OMERO Plus platform. In this report, we summarize some of the principles we use for constructing and running these systems for our academic and industrial partners.
Selected strategies and architectures
There are a number of options for publishing data with OMERO Plus, and the right strategy depends on considerations for both IT and Science. The flexibility of OMERO Plus allows us to adapt to the requirements of different use cases and domains, all based on a common, scalable platform. Here we outline some examples, including explanations of why an institution might choose one or another.
The OMERO Plus Platform
OMERO Plus is an enterprise image data management system that handles imaging data from >160 different formats and most domains used in life sciences R&D. Based on the open source OMERO application from OME, OMERO Plus scales from GBytes to PBytes of imaging data and from individual labs to globally distributed organizations. OMERO Plus can be deployed on premises, in commercial clouds, or in hybrid environments. Users can access their multi-dimensional image data through a secure web browser connection, or for more advanced computational use, through a fully open, cross-platform API.
A public OMERO Plus installation necessarily exposes an installation to heavy traffic and potential security threats. Below we list possible configurations to support large-scale usage and secure access.
Using the existing browser client for data access
OMERO.web is a client of OMERO Plus and allows secure data management, visualization and sharing via the web browser. The concept of users, groups and permissions in OMERO Plus can be leveraged to share data with specific users or even publish data publicly without any architectural changes. See further documentation here.
To scale up OMERO.web especially for data publishing, it is possible to run an additional OMERO.web instance, on a dedicated, separate host. Potential advantages include separate scaling of the public and internal client as well as improving the security profile by separation of the OMERO.web application from systems connected to institutional network storage.
Dedicated data services for published data
While the above strategy of running a separate OMERO.web can provide minimal control over scaling public and internal instances, the server-side data services provided via the OMERO Plus microservices (OMERO.ms) are often the best opportunity for tuning the application based on usage. Therefore, running both OMERO.web and OMERO.ms on an additional server(s) can provide excellent compute resource separation.
Dedicated OMERO Plus for published data
If complete isolation of the public environment is required, the best solution is to run a separate OMERO Plus instance. This is the most secure, as all components including storage, database, and compute can be completely isolated. Users and groups may also be managed separately, and application configurations can differ from internal systems as needed.
While IT security teams frequently ask about “security” in the sense of malicious actors and data breaches, it is also critical to consider and implement the data-level security that determines what data should in fact be public. Researchers must inherently be the arbiters of what data should be published and when.
Data under management in OMERO Plus does not consist of only images, but also image metadata, including textual, tabular and spatial annotations. OMERO Plus enables the separation and publication of subsets of data via a variety of routes, including web interfaces, command line tools, and APIs in popular programming languages. Finally, the OMERO Plus Permissions Model provides control over which users can actually publish data.
Data sanitization and licensing
A particular challenge in publishing data is the sanitization of Personal Identifiable Information (PII) via anonymization or de-identification. PII can exist within file names and original acquisition metadata, within an image itself, and within metadata added in a data management platform like OMERO Plus. Experimenter user names, institution or research group names, and study identifiers are commonly embedded within original files. Some acquisition systems embed similar PII directly in the acquired images, while others produce supplemental slide label and/or barcode images which may contain identifying information. Ensuring sanitization of data is a process that is inherently unique to the study and research group. Ideally, data publication and PII sanitization should be considered throughout the entire life cycle of the data, even prior to acquisition.
OMERO Plus uses Bio-Formats, an image translation library, with support for over 160 different file formats. With this awareness of original files and their metadata, we have built metadata sanitizing tools for numerous research groups. These tools allow for either anonymization (removing PII entirely) or de-identification (replacing PII with coded or non-identifying metadata).
Finally, particular attention should be paid to the metadata before data enters the public domain. The FAIR principles provide a series of authoritative guidelines on the metadata that should be made available for digital assets. Specifically for publishing, it is essential to clarify the conditions of data re-use e.g. by specifying the appropriate usage license - see R.1.1. OMERO annotations provide a flexible way to express this information and make the data compliant with the legal obligations associated with publication.
Glencoe Software is proud to contribute both technologies and expertise to make data publication possible for our academic and industry partners. Please contact us below to discuss your use case in more detail.