Supporting Large 3D Datasets in MBAT
We need to pick at least one data format for large (larger than available RAM, larger than a 32-bit address space) 3D image volumes. This rules out some formats, most notably multipage ("3D") TIFF, which uses 32-bit internal offsets and breaks for files larger than 2GB. I haven't been able to determine whether Analyze 7.5 format fully supports files larger than 4GB, but I've been able to save a 4GB stack in this format and reload it without errors or obvious problems.
Datasets this large impose several constraints. Downloading an entire dataset is impractical for many use cases, as it can take several minutes even on a fast local network, and hours on an intercontinental or residential network. Reslicing a non-memory-resident dataset "on the fly" is impractical as well, taking many seconds or minutes unless the dataset is specially formatted (see NeuroTerrain and related work); such special formats generally aren't widely supported by other imaging applications, so we would need to provide (and support!) software to translate volumes between special and standard formats.
It would be nice if we could store and serve large data in a compressed format, but as far as I know, common compressed formats (.gz) don't provide a computationally cheap way to seek and retrieve an arbitrary uncompressed "chunk" of data from the middle. We could use a ZIP archive of individual compressed slices, but I'd want to see some performance tests before picking this as our format of choice. I'd like to pick an uncompressed format for our first release.
Use cases
We want to support, at a minimum, browsing 2D slices of a 3D volume in the volume's "native" orientation. Slices in the native orientation may be stored as individual 2D image files, or as contiguous blocks within a 3D image file. It's generally easy to load and display native slices quickly enough to do it as part of a direct-manipulation task, so the user can adjust a control and watch for the desired slice to appear -- this implies a total latency of well under one second to load an image.
We would like to support browsing along three orthogonal axes -- the native axis of a volume (x-y planes at the current z location), and the other two "natural" axes (y-z at the current x location and x-z at the current y location). A simple "brute-force" solution is to pre-generate volumes in the other two orientations, and then retrieve slices natively from the appropriate volume. This triples storage overhead, but would seem to minimize computation and bandwidth to memory, storage, and the network. It also necessitates a way to keep track of the multiple, related volumes -- possibly a simple naming convention, but preferably a metadata format like the .keg or .atlas file.
We want to superimpose label information on large datasets. This implies that the atlas display system should be able to take advantage of large-data access methods, which implies architectural constraints on both the large-data model and the atlas viewer.
Implementation Details
Network vs. local file access
I anticipate that we will want to support both networked and local access to large datasets. I would prefer a design that isolates this issue as thoroughly as possible. At a high level of abstraction, we want to specify a dataset by name/location, request a slice from that dataset by orientation and position, and receive a 2D image representing that slice.
Obviously, NeuroTerrain already provides this for network data. I don't want to re-invent significant portions of the NeuroTerrain protocol! We may, however, want to define a subset of NeuroTerrain's functionality that would be easier to support for "dumb" data formats and protocols -- say, only supporting three orthogonal orientations (or a single native orientation), and only native resolution.
One large file vs lots of small files
If we represent a large 3D volume as a series of files (typically one per slice), it's easy to "manually" select files representing a subset of the whole volume, or open the volume in sections, or split it across small removable media. This also happens to be the native format for CIVM volume data, which makes it convenient for us. However, opening and closing a file entails more overhead than seeking within an already-open file (see below), and moving and copying large collections of files is less convenient than manipulating single files.
If we use a single-file (or file-plus-header) representation, it becomes harder to manually explore subsets of the volume outside of
MBAT, but I don't think that should constitute an
MBAT design constraint. It's potentially more efficient to work with a single file (see below).
Persistent channel vs. open/close for each new slice
MBAT's current image-reading model opens a file or network connection, loads the image into RAM, and closes the connection. So does ImageJ, and so do most other programs, at least by default.
Some "virtual" access methods, like ImageJ's Virtual Stack facility, do the same thing on a slice-by-slice basis -- to get a slice (or a row or a single voxel), you open the corresponding file, read it (in its entirety) into RAM, and close it. This fails miserably for operations that cross planes, as you're potentially doing N^2 open/read/close cycles to get an N*N slice. It can even cause problems for native-orientation browsing, though, because ImageJ's event-handling loop blocks on slice loading while it tracks the Z-axis scroll bar. If you're accessing a network service or file system that takes more than a few hundred milliseconds to open, read and close a file, navigation becomes difficult. In a network environment, it's very hard to meet that constraint. (When I did performance tests on SRB, opening a file imposed a 4- to 8-second MCAT delay.)
If we instead persist a connection -- keep a file open while we're browsing it, or keep a network connection alive -- we can get big latency improvements, which means big user-experience improvements. I'd like very much for our APIs to support this mode of operation.
Previous work, and lessons learned
Shiva and previous versions of
MBAT offered large-data support, including support for CIVM large datasets. As best I could determine, this worked by downsampling data as it was loaded. This didn't work out well for our needs -- the initial opening was very, very slow (slower than simply reading the entire dataset), the downsampled data wasn't interpolated for smoothness, and there was no way to "zoom in" to get a full-resolution view. Getting higher-resolution images is one of CIVM's main reasons for existing, and if a tool can't show us our data at full resolution, it isn't useful to us. Mandatory downsampling is not an option.
The CIVM lab database serves up TIFF slices in native orientation "on demand" over a network connection. It uses ImageJ as a viewer, with a plug-in that extends !!ImageJ's virtual-stack code. It works okay on our LAN, but it's painful over remote connections because of how ImageJ interleaves image-load, display refresh, and navigation UI actions. I added a facility to cache a number of slices in RAM; this helps a little. Were I to start over, I'd want to re-engineer the event loop for stack navigation, and I'd consider loading a compressed image format (JPEG) during navigation, replacing it with an uncompressed TIFF when navigation pauses.
VoxStation, a commercial LIMS we use at CIVM, serves up JPEG or TIFF images in native or orthogonal orientations. It precomputes orthogonally resliced stacks, essentially keeping each stack in triplicate; it also precomputes JPEG image sequences for quicker transfer. Performance is outstanding on a LAN, and quite usable even over slower and higher-latency remote connections. The viewer, also based on ImageJ, caches and prefetches data aggressively.
Conclusions (for the moment)
The large-data viewer plugin should be able to display both local data (from a local file) and remote data (from a network service). If our high-level interface accepts an orientation and an offset, and returns a data block representing a 2D image, we can provide implementations for both access methods. At their simplest, these methods could just compute an offset into a file, grab one slice worth of data, and return it. We'll need to figure out how to negotiate things like voxel depth, byte-ordering and so forth.
We can gain performance benefits if we support persistent connections, either to an open file or to a network service. Adding a simple state component (open, closed, broken) to the interface shouldn't be prohibitive.
We will support Analyze and/or Nifti as our initial large-data format, since these formats appear to support large volumes. I would like to add support for CIVM native format (series of raw slice files), but I don't consider this a requirement for the first release.
For the first release, we may want to support only native-orientation viewing. Precomputed three-axis viewing could come in a subsequent release.
--
JeffBrandenburg - 14 Jan 2009