Glencoe Software, the Open Microscopy Environment, and other members of the bioimaging community recently published a paper in Nature Methods describing the motivations and specification for the "Next-Generation" of open image formats like OME-TIFF. Here, we highlight Glencoe Software's perspective on and ongoing work toward adopting and advancing Next-Generation File Formats (NGFF).
Why do so many file formats exist?
The majority of biological and biomedical imaging (collectively “bioimaging”) file formats are produced by microscope vendors and are understandably optimized for the acquisition write workflows of a particular instrument. Bio-Formats, a real-time image translation library, developed and supported by both the Open Microscopy Environment and Glencoe Software, has been transformative in allowing cross-format workflows in a variety of image processing and visualization software. Format-specific readers are complex and must be maintained in perpetuity due to the reliance of the bioimaging community on these tools and the required longevity of data accessibility. Furthermore, working within the confines of an acquisition-only-optimized workflow hinders progress in adapting to novel information storage and processing architectures. Thus motivates the adoption of standardized, unified and cloud-native file formats for bioimaging data.
This is a critically important question. Why would the teams who have worked so hard to build Bio-Formats and ‘defeat’ the ever-increasing number of file formats, of which there are hundreds in bioimaging alone, come up with a new one? Despite the reality that OME-NGFF is yet another file format, its differences from other implementations provide some real advantages for future-proofing bioimaging workflows. Specifically, it has been designed from its inception to work with scalable, cloud-based data resources and for public or shared data repositories used for AI training and data publication.
Let’s describe what OME-NGFF is.
A new, open data format: There is a well-developed and active community, open specification, and liberally licensed readers, writers and converters (bioformats2raw, raw2ometiff, ZarrReader, ome-zarr-py). There are examples of using OME-NGFF for whole slide imaging (as in digital pathology), high content screening and 3D imaging of large tissue samples.
A format for cloud-native storage: As the volume and complexity of data grows, many of our customers (and the bioimaging community as a whole) are moving toward cloud-based solutions for storing, sharing and analyzing bioimaging data. The cost comparison for storing data in commercial cloud systems (i.e., Amazon Web Services, Microsoft Azure, Google Cloud Platform) versus local or institutional facilities depends on technology choices.
The most cost-effective cloud resources use a set of loosely defined technologies called “object”, “S3”, or “blob” storage (collectively “cloud-native storage”). These are relatively simple data recording and access resources that scale to petabytes. The catch is in the “relatively simple” part. Cloud-native storage eschews classical file system functionality like fine-grained permissions, and its architecture focuses on maximizing aggregate throughput over latency. Utilizing traditional file formats on such storage is extremely challenging for applications with high data complexity, such as scrolling through a timelapse sequence or a multi-dimensional pyramid, reading subsections at a time.
OME-NGFF makes it possible to use cloud-native storage for complex, multi-dimensional bioimaging data by breaking a large dataset into “chunks”, or small files that can be easily retrieved. A metadata specification allows any software to understand where to find the files and how to put the chunks back together.
A new format for data streaming, enabling new ways of data sharing: This is perhaps the most important point. Much of the scientific community accesses data through download, in which data files are transferred onto a computer for further analysis. While this is acceptable for data in the MB to a few GB, once datasets are >10 GB, network latency, transmission errors, and the limits of web browser technology make download of bioimaging data impractical. Because OME-NGFF breaks data into easily read pieces, a visualization and analysis application can easily access what it needs, when it needs it, without needing to fully download very large datasets.
Let’s also be transparent on what OME-NGFF is not.
A completely novel data storage technology? Not really. In developing OME-NGFF, we have aligned our work with that from a broad cross-section of the data sciences community, including Pangeo, NASA, the UK Met Office and others. OME-NGFF is an adaption of technology that is emerging across the natural and applied sciences and is a great example of communities working together to leverage each other’s work and experiences.
Proprietary to Glencoe Software or OME? Nope. There is an open specification, open source liberally licensed implementations (bioformats2raw, raw2ometiff, ZarrReader), and an active development community. Anyone can use it and we are happy to support your adoption.
A new technology that commercial imaging companies should immediately adopt? Probably not, although we think everyone in the community will at least want to keep abreast of developments in this space. During the current phase of OME-NGFF development (i.e., during 2022), we are prioritizing metadata support, compression, and implementations for many of the different bioimaging domains. Write-optimized implementations, i.e., software that prioritizes writing OME-NGFF as quickly as possible, is not yet available, so the prerequisites for full adoption by commercial imaging companies are not yet available.
A replacement for other established open formats? Not really. We came up with OME-NGFF exactly because existing open image data formats like TIFF and HDF5 work very well in many traditional use cases, but are ill-suited for the use cases we repeatedly find most challenging for our customers and colleagues:
- Shared data resources for AI training and data publication;
- Cloud-based data resources (especially those that employ “object”, “S3” or other highly scalable data storage technologies.
OME-NGFF is a new format that is designed to serve and perform in new use cases where existing formats don’t work well.
“How to” NGFF?
Learn more. Want to learn more? A good place to start is the OME-NGFF paper. It is open access and contains background and benchmarking for multiple file formats and storage back-ends. For an even deeper dive into the details of file formats and rationale for OME-NGFF, see the Supplemental Note.
The particular technology selected for the implementation of OME-NGFF is Zarr, which is a format for storing chunked, compressed, N-dimensional arrays, and a technology applicable to numerous domains in natural and applied sciences. For this reason, you will see references to Zarr in OME-NGFF-relevant tooling and sample data.
View and analyze data on-line. Our colleagues at IDR maintain a public repository of OME-NGFF sample data from numerous sources. These are examples of how data can be shared on-line, either publicly or with colleagues using cloud-native data sources.
Glencoe Software hosts reference OME-NGFF data in S3 here: s3://gs-public-zarr-archive.
Generate OME-NGFF data of your own. Try
bioformats2raw, a command line tool developed by Glencoe Software to convert any Bio-Formats supported file format to OME-NGFF.
bioformats2raw is open source and the latest release can be found here. For further reading on the motivations behind a high-performance conversion pipeline, please see our previous blog post here.
Want to integrate OME-NGFF into your custom analysis pipelines? See our example of accessing image data from an OME-NGFF.
Where will OME-NGFF go from here?
Glencoe Software is committed to supporting OME-NGFF in our products, including OMERO Plus for data management and PathViewer for visualization and annotation of bioimaging data. PathViewer already supports the visualization of OME-NGFF image and label image data, and OMERO Plus will support the import and management of OME-NGFF image data by mid-2022.
To keep up with ongoing efforts, join the OME-NGFF group or follow the OME-NGFF tag on image.sc and join the regular OME-NGFF Community Calls announced there. The OME-NGFF roadmap is also publicly available on GitHub.
Check back here for future announcements of ongoing efforts in Next-Generation File Formats by Glencoe Software.