Similarity Searches of Medical Image Data in Peer-to-Peer Systems Amalia Charisi and Vasileios Megalooikonomou, Member, IEEE Abstract-The objective of this study is to effectively perform content-based medical image retrieval in distributed systems. We present a method that constructs a distributed index over a peer-to-peer network. Considering the network bandwidth limitations and other restrictions that are associated with the handling of medical data, we do not further distribute images between the participant peers in the network. We distribute only feature vectors, extracted from each image from which only a low resolution image can be obtained. The images are processed locally at each site. For the index distribution, we develop our own hash function that is based on multi-resolution analysis of the images using the wavelet transform and on a set of reference images that is known to each node in the network. To evaluate our method and demonstrate its applicability, we performed similarity searches on a brain image dataset. We also compared the performance of the distributed system to that of the centralized one. I. INTRODUCTION R ECENT technological advances in medical image acquisition techniques have facilitated archiving and analysis of 2D or 3D images in healthcare institutions worldwide. Scientists as well as clinicians are interested to learn from this data, in order to assist medical diagnosis. Moreover, the evolution in networking and telecommunication technologies in combination with the above progress has enabled physicians and researchers to share their knowledge, by transferring medical images and clinical data, giving birth to the emerging field of telemedicine. The deployment of telemedicine applications raises the quality of healthcare services by also promoting diagnostic support in remote centers where the transfer of patients may be difficult. Nevertheless, the complexity of implementing a telemedicine system due to communication constraints (i.e., bandwidth limitations), but also security and confidentiality issues introduces challenges both for engineers and medical experts. Here we consider the problem of efficiently performing similarity searches on distributed medical image data in order to facilitate diagnosis and prognosis. Clinical knowledge has shown that visual characteristics of medical images have strong effect on diagnosis and can help the Manuscript received June 15, 2010. This work was supported in part by a University of Patras "Karatheodori" grant. A. Charisi is with the Department of Computer Engineering and Informatics, University of Patras, Patras, Greece (phone: +302610997534; fax: +303610969018; e-mail: charisa@ ceid.upatras.gr). v. Megalooikonomou, is with the Department of Computer Engineering and Informatics, University of Patras, Greece and with the Data Engineering Laboratory, Center for Information Science and Technology, Temple University, USA (e-mail:
[email protected]). 978-1-4244-6561-3/101$26.00 ©2010 IEEE clinician to interpret the results. Several studies in the literature have considered the problem of similarity searching in distributed image databases [1], [1], [3]. In [1] a query is issued in every database that participates in the distributed system, whereas in [3] a central server receives the query and sends it to every node that stores images. In [1] a framework for an image search engine is proposed. In the field of telemedicine, research on analyzing medical images, aiming to support medical decision making, is in progress. The studies that have been proposed have expanded in several fields like classification [4], searching for Regions ofinterest (ROIs) [5], etc. These studies propose the use of distributed systems to take advantage of the processing resources that are offered by multiple sites. In this paper, we propose a method for constructing a distributed index in order to efficiently perform similarity searches of medical image data over a network of peers. The distributed index is composed of indices of individual images. Each image's index contains a feature vector, an IP address and a unique name. To determine the node that stores the image's index, we build a hash function that is based on a multi-resolution analysis of the images and on a set of reference images known by all nodes. For the multi resolution analysis, we use the wavelet decomposition that is very popular in medical imaging [6]. Taking into account bandwidth limitations and other restrictions that are associated with the handling of medical data, the images of the local databases are processed locally at each site that participates in the network; only their index is distributed. We assume that security and confidentiality issues of medical data are handled at each site following the procedures and protocols of each institution. By providing similarity searches capability on medical image data, the system can assist the clinician or medical doctor in making more informed decisions based on related clinical data and other physicians' evaluations on similar clinical cases. The proposed methodology is tested on the peer-to-peer system, Pastry [7]. II. METHODOLOGY The proposed methodology is based on the multi resolution analysis of the distributed images. It starts with a preprocessing step, where features are extracted locally from the distributed images using the wavelet decomposition. Then, the distribution of the feature vectors is performed, based on a hash function that uses the feature vectors extracted from the images at multiple resolution levels of the wavelet decomposition and a set of reference images. This process identifies the peer that stores the index of each particular image. Our goal is to store similar objects (feature vectors) to adjacent peers. A. Pre-processing step In the proposed framework, we use wavelet decomposition for feature extraction. The main reasons that led us to use wavelet decomposition are a) the fact that the decomposition can perform a multi-resolution analysis and b) that it can record local features in data. These characteristics of Wavelet transform make it very useful for our method, which requires the use of a multi-resolution approach for the index's distribution. The pre-processing step is performed locally at every node. This process includes the feature extraction from multiple resolution levels of the wavelet decomposition. The coefficients that are extracted are used to determine the node that stores the image's index. The image's index is composed of 1) a feature vector, from which only a low resolution image can be obtained, 2) the IP-address of the node that stores the original image and 3) a unique name, with which the image is stored in the local database. The global distributed index is composed of the indices of all images. For the wavelet decomposition in our experiments we use the Haar wavelet. B. Image's Index Distribution In structured peer-to-peer systems, every object has its own unique key with which it is stored in the network. The nodes that participate in the network have the same property, i.e., every node has its own nodeID which is uniformly produced by a hash function. An object is stored in the node that has nodeID that is the most similar to the object's key. (� fEl Fe.toreVect.r � �- 8 ,�".,,�,:: �'w i ' . � FeatureVector .. , le,·tI FeatureVector 1� .. 1.-2 FeatureVector Jenl- I ... �.M � Fig. I. Extracting the feature vectors and finding the most similar feature vectors of a set of reference images to the indexed image. The colored cells represent the most similar reference image feature vectors at each resolution level. In the proposed framework, we aim for each node to store similar objects after the construction of the distributed index is completed. To achieve this, we form our own hash function that creates the keys from the content of the images. The hash function is also based on a set of reference images. In our case the objects are not the images themselves (since these are kept at their original sites) but rather pointers to the images. For every image, that is stored in each peer's database, and for every resolution level the following steps are executed: 1) extract the image's feature vector (Fig. I), 2) compare the image's extracted feature vector with the feature vectors of the set of reference images, and find the most similar one (Fig. I), 3) include in the object's key the index of the most similar (with respect to corresponding feature vectors at each level) reference image. The key is a unique number for every object. The most significant bits of this key determine the node participating in the common distributed index that stores the image's index (Fig. 2). The less significant bits are provided by a random number generator to make each key unique (Fig. 2). This way, every stored object gets a unique id which is a requirement for Pastry. hash fimction's key � ------... I I I I I xxxx Rl R�'l R2 R�i_ l Rk random number Fig. 2. The resulting key that is produced by the proposed hash function for every indexed image. lRi corresponds to the most similar reference image of each of the 5 resolution levels, where i is the reference image's index in the set of reference images. C. Selection of the Set of Reference Images The set of reference images is either arbitrarily chosen or it is chosen by the domain experts, i.e., the clinical experts. An expert can select the representative subset taking into account the modality of the images or the particular region of interest that is represented in the image, considering also the retrieval tasks that the system will perform. A peer, that is responsible for this process, collects the subset of the images. The peer uses the wavelet decomposition at various resolution levels, to compute the images' feature vectors which are then distributed to the network. D. Query Process Our method performs similarity searches using the distributed index. When a query image is submitted, its feature vector for every resolution level is calculated using the wavelet decomposition. Then, using the set of reference vectors and the hash function, its key is calculated, and the query is routed to the node that has a nodeID that is the most similar to that key. The node at which the query arrives is denoted as destination node. The node that issues the query, in addition to the key, sends to the destination node the number of requested similar images. In case that, the destination node does not contain the total number of the requested items, it performs a progressive lookup to its neighbors, until it collects the number of the requested items. If the destination node has collected more objects than it is needed, it sorts the results and retums to the node that issued the query, the most similar objects to the query. The number of the requested objects is a parameter of our system. III. EXPERIMENTAL RESULTS In order to evaluate the proposed framework, we tested it on a medical image dataset that is obtained from an fMRI study designed to explore neuroanatomical correlates of semantic processing in Alzeheimer's Disease [8]. The dataset consists of 3D fMRI activation contrast maps. Two groups of subjects are included in this study: 9 controls and 9 patients. For each subject there are 68 20 slices with dimensionality of 79 by 95. In our similarity analysis experiments, we consider each one of the 20 slices as a separate image resulting in a total of 1224 different images. We conducted experiments using a number of parameters to evaluate the efficiency and the accuracy of our method. To quantify the accuracy of retrieval results, we used two measures of accuracy: A 1 !object from same class! ccu = *100% (1) K A 2 !object from same subject! (2) ccu = *100% K The first accuracy (Accul) defined in Equation (1) is the percentage of requested images that are from the same class (in our case, controls or patients), while the second accuracy (Accu2) defined in Equation (2) calculates the percentage of requested images that are from the same subject as the query image. Higher percentage of requested images from the same subject or from the same class should be considered as a better retrieval result. This is due to the correlation that exists among different slices (20 images) of the same subject (3� image) especially neighboring slices. Within the same class we expect a more homogeneous set of images than across classes. In our experiments we used as parameters the number N of nodes that participate in the network, the number K of requested images and the number L of decomposition levels used. In order to investigate the accuracy of our method, we considered each of the 1224 slices as a query and returned its top K matches of similar images. For the implementation of our system, we used as simulator the Pastry peer-to-peer system. Our methodology was built and tested on top of the infrastructure provided by Pastry. The set of reference images in the experiments was chosen arbitrarily. The number of reference images, rv, used was 252 unless specified otherwise. This relatively small value was chosen empirically and gives the best retrieval results in the experiments we conducted with the given dataset. For a fair comparison, in the implementation of the centralized system we used the Haar wavelet decomposition, and in particular, the coefficients of every resolution level (of the decomposition) to construct the feature vectors. To approximate the function of the distributed system, where every resolution level has its own significance in the construction of the object's key (see Section III. B), in the centralized system we used different weights for every resolution level in order to find the most similar objects. The weights were selected, based on experiments we performed, in order to maximize the efficiency of the centralized system and were equal to: 0.8 for level 6, 0.1 for level 5, 0.04 for level 4, 0.03 for level 3, 0.02 for level 2 and 0.01 for level I. When experimenting with 4 decomposition levels (levels 6, 5, 4, and 3) the same weights were used for each of the levels. When employing only one decomposition level, we did not use any weight. We also measured how the accuracies are affected by the number of decomposition levels. We used different numbers of levels L=I, 4 and 6. For the computation of the feature vector's key, we used as the most significant level, the coarser one, i.e., level 6. The results were obtained from experiments using various numbers of participant peers and requested images, but the same set of reference vectors. Figure 3 illustrates the retrieval accuracy in terms of the percentage of requested images that are from the same class. Figure 4 shows the retrieval accuracy in terms of the percentage of the requested images that are from the same subject as the query. As we can see, the use of more decomposition levels results in slightly better accuracy. The fact that there is not a great difference between the resulting accuracies is because the number of indices that are distributed over the network of peers increases with the number of decomposition levels. When we use only one decomposition level the indices are distributed within a very small subset with respect to the number of participant peers. As a result all the indices of the images are stored in those nodes. The variability of the accuracy for various settings of the decomposition levels is more obvious in Figure 4. 0.9 __ O . B 0.7 0.6 0.5 0.4 0.3 0.2 0.1 o . K =40 Fig. 3. Retrieval accuracy (Accul) for different numbers of requested images (K=20, 40) and number of participant peers (N=IOO, 200, 400) accordingly (a) L=I and the chosen decomposition level is the coarser one (level 6), (b) L=4 and the most significant level is the coarser one (levels 6, 5, 4, 3), (c) L=6 and the most significant level is the coarser one (level 6) The accuracy also increases when the number of participant nodes increases. This is happening because the index becomes more "discrete"; i.e., the same number of images is handled by more peers, and thus, each peer stores images (or more accurately, images' indices) that are more similar to each other, compared with the case that the same number of images is handled by a fewer number of peers. Finally, as the number of requested images rises, the retrieval accuracy decreases. The same observations hold for the centralized system. As expected, the centralized system has better performance in terms of retrieval accuracy compared to the distributed system (Fig. 5). This is happening, because it does not make similarity decisions based on a limited set of reference images but rather using all the available data. 0. 7 .r------------------ 0.6 .v---;;-._---=--=-___ ----=--;;-_.- 0. 5 �-II-I__I:_-_IHI-...... r__II_ __ -1-_a;_ 0.4 -I-IIrl _____ -_IH __ .-III-_I .. ______ t- 0.3 �HI-___ -_IH __ .-III-_I .. ______ t- 0.2 J-II1-I _____ --1� ______ I---1 _________ t_ 0 . 1 J-II1-I _____ --1� ______ I---1 __ .-_____ t_ O ���������������� • K= 20 . K= 40 Fig. 4. Retrieval accuracy (Accu2) for different numbers of requested images (K=20,40) and number of participant peers (N=IOO, 200, 400) accordingly (a) L=I and the chosen decomposition level is the coarser one (level 6), (b) L=4 and the most significant level is the coarser one (levels 6,5,4,3), (c) L=6 and the most significant level is the coarser one (level 6). Din. rv=72 Dist.rv;126 rv:;;252 centralised (a) (b) (c) (d) • Oist. N=100 • Oist. N=200 • Oist. N=400 Fig. 5. Retrieval accuracy (Accu l ) for L=6 (levels 6,5,4,3,2,1), K=20 and N=IOO, 200, 400 accordingly (a) set of reference images rv=72 , (b) set of reference images rv=126 ,(c) set of reference images rv=252, (d) centralized system. To show the effectiveness of our system, we measure the number of messages that need to be sent when we issue a query. In the number of exchanged messages we include the messages to the neighbors and the messages needed until the query gets to its destination node. Table 1 illustrates the results. As we can see when the number of participant peers increases, the number of exchanged messages increases too. In general, only a small number of exchanged messages is sent, considering the fact that the destination node asks from its neighbors their stored indices during the query process. TABLE I AVERAGE NUMBER OF EXCHANGED MESSAGES AND TRANSFERRED BYTES Average Number of Messages and Bytes K=20 K=40 N=100 1.84 64552 2.5\ 124602 N=200 4.\9 41554 4.54 96434 N=400 5.36 30147 5.98 57164 IV. CONCLUSION In this paper, we presented a method for similarity searches of medical image data in peer-to-peer systems. We constructed a distributed index based on a low resolution feature vector of the image, the IP-address of the source node and the image's Id-name. Our system does not require the transfer of images among peers for the processing of the similarity searches but just for delivering the full result to the requesting node. Considering the network bandwidth limitations and other restrictions that are associated with the handling of medical data, it does not further distribute images between peers. Our goal in this work was to have similar objects end up in the same or adjacent nodes of the peer-to-peer network. For the identification of the node that stores the index, we designed a hash function that was based on the multi-resolution analysis of the distributed images. We compared the proposed p2p system with a centralized system that uses all the images in order to make similarity decisions. Furthermore, we evaluate the effectiveness of our method by measuring the number of exchanged messages during the query process. Experimental results illustrated that the proposed distributed system performs similarly to the centralized one for certain experimental settings. Concluding, the main contribution of the proposed methodology is that it can combine medical images from different sites in a common distributed index, allowing scientists or physicians to easily access other clinical data, assisting in effective medical decision making. ACKNOWLEDGMENT We would like to thank Prof. Peter Triantafillou and MSc student Andreas Loupasakis for providing information about the Pastry. REFERENCES [I] K. Jarrah and L. Guan, "Content-based image retrieval via distributed databases," in Proc. of the 7th Int. Coif. on Content-Based Image and Video Retrieval, New York, 2008, pp. 389-394. [2] A Vlachou, C. Doulkeridis, D. Mavroeidis, and M. Vazirgiannis, "Designing a Peer-to-Peer Architecture for Distributed Image Retrieval", in Proc. of Adaptive Multimedia Retrieval, Paris, 2007, pp. 182-195. [3] R. Maree, P. Denis, L. Wehenkel, and P. Geurtis, "Incremental Indexing and Distributed Image Search using Randomized Vocabularies", in Proc. of the Coif. on Multimedia Information Retrieval, Philadelphia, 2010, pp. 91-100. [4] P. Luo, K. Ln, Z. Shi, and Q. He, "Distributed data mining in grid computing environments", Future Generation Computer Systems, vol. 23, no. 1, pp. 84-91, Jan. 2007. [5] V. Megalooikonomou and D. Kontos "Medical data fusion for telemedicine. A model for distributed analysis of medical image data across clinical information repositories", IEEE Engineering Medicine and Biology Magazine, vol. 26, no. 5, pp.36-42, Sep. 2007. [6] M. Unser, A Aldroubi, and A Laine, "Guest Editorial: Wavelets in Medical Imaging", IEEE Trans. on Medical Imaging, vo1.22, no. 3, pp. 285-288, May 2003. [7] A Rowstron and P. Druschel, "Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems", in Proc. of the Int. Con! on Distributed Systems Platforms, Heidelberg, 2001, pp. 329-350. [8] AJ. Saykin, L.A. Flashman, S.A. Frutiger, S.c. Johnson, AC. Mamourian, C. H. Moritz, JR. 0' Jile, HJ. Riordan, R.B. Santulli, C.A Smith, and 1.B. Weaver, "Neuroanatomic substrates of semantic memory impairment in Alzheimer's disease: Patterns of functional MRI activation", Journal of the International Neuropsychological SOCiety, vol. 5, pp. 377-392, 1999.