Due to the popularity of digital cameras and camcorders, we have witnessed the dramatic increase of visual content such as photos and videos in recent years. The increased use of visual content has fostered new multimedia applications. For instance, flickr. com, a photo sharing web site, allows registered users to share their photos and tag keywords to the photos for categorization and browsing. Photo blog is another popular application that enables users to publish their photos and add annotations (Figure 1b).
In video domains, video streaming is becoming increasingly pervasive. Many major online media websites have established their free news video services, such as CNN. com (Figure 1c). The emergence of these applications call for the need of managing vast amount of visual content in automatic ways. In contrast to the significant advancement of the multimedia applications. Technologies for automatic visual content analysis and indexing are still lacking. Until today, it remains difficult for a computer to understand the content in a natural image.
Order custom essay Digital cameras and camcorders with free plagiarism report
The understanding of visual content in an image is considered so difficult that people has coined a phrase called “semantic gap” to indicate the difficulty of translating the low-level features into high-level semantics. Most of the traditional techniques for analyzing and indexing content are based on the classic pattern recognition theory . This standard paradigm first represents an image or a segment of video as a feature vector, which is then followed by a classification machine to map the features (continuous or discrete variables) to semantic labels (symbolic variables).
Behind this paradigm is the rigorous mathematical framework for automatic learning and decision. Research on automatic visual content analysis has been following this framework for many years. The advancement of the research on machine learning and pattern recognition has resulted in sophisticated statistical and learning algorithms. Indeed, we have seen that the performance of image retrieval and semantic concept detection has significantly increased due to the use of machine learning algorithms, such as Support Vector Machines (SVM), Boosting and Ensemble learning.
There is no doubt that statistical and machine learning algorithm will continue to play indispensable roles in visual content analysis systems. Yet, despite the pervasive use of statistical methods with sophisticated optimization algorithms, the performance of visual content analysis and indexing by computers is far behind what can be done by their human counterparts. Such discrepancy raises fundamental psychological and philosophical questions beyond technical designs of recognition algorithms: how does human being perceive images.
For decades, human vision researchers have been studying how human being perceives and understands images . The psychological studies indicate that there is a dramatic difference of image understanding between human being and the standard paradigm mentioned above. It turns out that human image perception is goal-driven, active and partly sequential . The process can be roughly divided into two subsystems: pre-attentive subsystems and attentive subsystems. In the pre-attentive subsystem, human eyes try to detect salient parts, such as corners, in the image.
In the attentive subsystem, these detected parts are grouped into meaningful high-level entities such as objects. Feedback from the attentive subsystem could in turn assist the pre-attentive subsystem to locate the salient parts. Such mechanism motivates the use of similar approaches in computer vision or image retrieval, namely, decomposing images or objects into a collection of parts (visual primitives such as regions or corners) and exploring part attributes and relations for analysis and retrieval.
Image analysis by parts is not a new idea, and has been widely adopted by the computer vision researchers from its very beginning. For example, generalized cylinder is a part-based model proposed by Marr  for describing three-dimensional objects. Marr proposed that a three-dimensional object can be constituted by a set of generalized cylinders, whose parameters characterize individual parts. Similar representations for two-dimensional objects have also been developed. For instance, a well-known face recognition model called Elastic Bunch Graph  represents a face as a set of parts with spatial relations.
Despite the success of these methods in constrained domains, such as synthetic images or photos under controlled laboratory conditions. It is difficult to apply them for real-world images or videos. The main problem is that irrelevant contents and noise are ubiquitous in real-world images. As a result, it is difficult to guarantee that parts relevant to the target object can be accurately detected and features are precise to describe the detected parts. Taking the example of the Elastic Bunch Graph model, it generally requires that the vertices of the graph be well registered to the corner points in the face.
This could be achievable for the face images under good conditions, it is rather difficult to realize if faces are embedded in complex scenes, or with mustache and glasses. On the other hand, if we look back at the pattern classification based methods, we will find that, in many circumstances, they can often achieve satisfactory results in real-world images or videos although their representations may be not optimal. One example is the SVM-based face detection system . The system exhaustively slides a search window over image plane, extracts feature vectors within the windows and feeds them into SVM for classification.
Although the computation cost of the system is extraordinary high, its performance has been shown to be satisfactory for face detection in natural images . The main reason for this robustness is that the SVM model is able to accommodate and learn face variations from the training data, achieving high robustness to various conditions. Based on the above analysis, a natural question is whether or not we can establish similar statistical framework for the part-based models. Namely, extend the pattern classification and machine learning methods to part-based representations.
Compared with the traditional feature vector based approaches, pattern classification and machine learning on part-based representation or non-vector space models is a much less explored area. Currently, there is no formal theoretical framework for non-vector space models. Therefore combining the part based model and statistical methods is not trivial. The non-triviality draws a clear boundary between the traditional deterministic part-based models and the new ones with statistical modeling. We may call the previous part-based models as classic part-based models and the new ones as statistical part-based models.
In order to make the following chapters more comprehensive, we first present a brief overview about the part-based modeling in the following section. 1. 2 Introduction to Part-based Modeling The part-based modeling procedure usually contains three major components: part detection, feature Extraction and part-based representation (Figure 1. 2). Part detection locates the salient visual primitives in an image, such as corners or local image patches. Feature extraction is used to extract informative attributes for describing individual parts and inter-part relationships.
Part-based representation is used to organize the detected parts and their attributes as an integrated entity, as an input to the part-based analysis subsystem, which could be similarity computation, object detection etc. Part Detection in Images Parts are elementary building blocks for part-based analysis. Here parts mean visual primitives in an image. Parts can be corners detected by corner detection algorithm, regions after color region segmentation, etc. The choice of specific part detector depends on characteristics of the applications.
For example, for image similarity, our experiments have shown that corner point based representation results in better performance than region-based representation. Corner detection is also known as interest point detection. Although there are several algorithms for corner detection, the principles behind those algorithms are very similar. The most widely used corner detection algorithm may be the Harris corner detector , which realizes corner detection by computing the second-order geometric features in a local image patch.
Another popular corner detection algorithm is the SUSAN corner detector , which realizes corner detection by calculating the second order spatial derivative around a pixel. Region segmentation is intended to partition an image into a set of disjoint regions with homogenous properties. Region segmentation has been an active research topic in computer vision. Major algorithms include K-mean clustering based approach, Graph Cuts  and mean-shift algorithm . Compared with interest point detection, the main limitation of region segmentation is its sensitivity to noise.
Segmenting two images with the same visual content may end up with totally different region segmentation schemes because of noise. Furthermore, because the segmented regions are disjoint, the errors of region segmentation often cannot be corrected in the later analysis stage. Another type of region detection algorithm, called salient region detector, yields overlapping regions rather than disjoint regions. The benefit of using overlapping region is to create an over complete representation of an image, so that in the analysis stage, high-level knowledge can be utilized to eliminate the irrelevant or noisy regions.
A popular salient region detector used in object detection is the maximum entropy region (MER) detector . The MER detector scans over the image. In each location (uniformly sampled on the image plane, or at each pixel), the MER detector initialize a circle and increase the radius of the circle until the entropy of the content within the region reaches maximum. Previous research work  has shown that the MER detector performs very well in object detection. Some of the examples of part detection is being conducted using corner detection, region segmentation and salient region detection. 1. 2. 2 Feature Extraction
After part detection, feature extraction is applied to extract the attributes of the parts. Features can be spatial features such as the coordinates of the corner points or the centroid of the regions. They can be color features such as color histograms within the regions; They can be texture features such as Gabor wavelet coefficients  or steerable filter coefficients . Recently, Lowe et al.  has proposed a new local descriptor that is extracted by Scale Invariant Feature Transform (SIFT), which has been demonstrated as the most robust feature against the common image transformations .
1. 2. 3 Part-based Representation Part-based representation is intended to group the detected parts and their features all together into an integrated entity as an input to the analysis system. Part-based representation represents an image as a collection of individual elements. Relationship among these elements may or may not be modeled. If the relationship among parts are not modeled, the representation is a bag-based representation. The bag-based representation has been widely used in information retrieval, and computer vision.
For example, bag-of-pixel model was proposed in  for image representation, bag-of-keypoints was used in object detection . If the relations among the parts are modeled, then it is a graph-based representation. There are different types of graph-based representations. The simplest one is the structural graph model, where the attributes of parts are not taken into account. Such representation can be used for modeling aerial images, maps and trademarks . However, the structural graph model has limited representation power for general image data, such as natural images.
Attributed Graph (AG) or Attributed Relational Graph (ARG)  is an augmented graph model, in which vertices and edges are associated with discrete or real-valued feature vectors to represent the properties of parts. Therefore, ARG can be used to represent general image data. There are different types of ARGs. If the attributes are integers or symbols, the model is called labeled graph. If there is no loop in the graph, the graph becomes Attributed Tree. Labeled graphs are important models for the research fields such as structural chemistry or structural biology.
Compared with general Attributed Relational Graph, labeled graphs also hold many special mathematical properties. Therefore they are also subjects of research on combinatorics, such as analytic combinatorics . For image applications, labeled graphs are less useful due to its limited representation power. Figure 1. 4 illustrates several examples of part-based representation. There are two prominent distinctions between part-based representation and vector-based representation. First, in vector-based representation, every feature vector has the same number of components, i. e. the same dimension.
Second, the orders of those components are fixed in vector-based representation. Part-based representation is more general than vector based representation. In part-based representation, the number of parts could vary across different data, and there is no order associated with individual parts. These two distinctions result in the problems of finding correspondences of parts and dealing with the removed or inserted parts. 1. 3 Required Contributions There is a need to exploit the statistical modeling and learning methods for part-based methods for solving a few problems in visual content analysis and indexing.
Specifically, the learnable similarity of images in the context of part-based models is a very important and exploitable property. Second, we can establish formal statistical models for describing the topological and attributive properties of a class of objects for object detection. Third, we study the problem of assigning labels to segmented regions in images through higher order relation modeling. We elaborate each of these problems below and present a brief overview of the past work and our main contributions. In each of the subsequent chapters, we will provide more detailed review of the prior work. 1. 3. 1 Image Similarity
Measuring image similarity is a problem arising from many multimedia applications, such as content-based image retrieval, visual data mining, etc. Image similarity itself can be formulated in a straightforward way: given two images I1, I2, output a real-valued number S(I1, I2) that measures their similarity. In the traditional content-based retrieval systems, similarity is simply calculated from the feature vectors that represent images, for example the inner product of two color histograms. Such similarity definition is simple, easy to compute and often holds good mathematical properties favorable for efficient algorithm designs.
Yet this definition is far from optimal. For instance, two images with exactly identical color histogram could contain totally different contents. We suggest that a good similarity measure should be the one that is consistent with human judgment. Namely, our proposed measure for image similarity should be motivated by the Human Vision System. Regarding learning of image similarity, there has been much work on similarity learning based on vector-based presentation, for example . It is difficult for these approaches to achieve good similarity due to the limitation of the feature vector based representation mentioned above.
In comparison, part-based representation provides more comprehensive information about the visual scene in an image. It cannot only capture the appearance features of an image but also can characterize the relationship among parts. In the thesis, we can establish a learnable similarity framework for part-based representation. We focus on the most general representation, i. e. Attributed Relational Graph (ARG). The similarity is defined as the probability ratio (a. k. a odds) of whether or not one ARG is transformed from the other.
This definition is partly inspired by the relevance model  used in information retrieval, where document retrieval is realized by calculating the probability ratio of whether or not the query is relevant to the document. Transformation based similarity definition is not new, but most of the prior work is based on computing the cost of transformation. Typical examples include string or graph edit distance , Earth Mover Distance (EMD) , etc. It is yet unknown how to establish formal learning methods on top of these deterministic frameworks. On the other hand, probabilistic methods for matching the vertices of ARGs are also not new.
Bayesian methods have been proposed and extended for matching structural or labeled graphs in previous papers . However, although the Bayesian method is formal and principled for matching vertices, how to define the similarity and aggregate the vertex-level similarity to the graph level remains a problem. Moreover, without the definition of the graph-level similarity, learning parameters often has to resort to vertex level annotations (vertex matching between two ARGs). Annotating vertex correspondence is very time-consuming since typically the vertex number of an ARG for representing an image ranges from 50 to 100.
One of the main contributions of this thesis is the development of a principled similarity measure for ARGs. We show how this similarity measure relates to the partition functions of the Markov Random Fields (MRFs) that is used for matching vertices. The log convexity of the partition functions of MRFs leads to dual representations (a. k. a variational representation) of the partition functions, which are linear and tractable. The dual representation allows us to develop a maximum likelihood estimation method for learning parameters. The developed approach has been successfully applied to detection of Near-Duplicate images in image databases.
Image Near-Duplicate (IND) refers to two images that look similar but have variation in content, such as object insertion and deletion, or change of imaging conditions, such as lighting change, slight view point change, etc. Near Duplicate images are abundant in Internet images and broadcast news videos. Detecting INDs are useful for copyright violation identification and visual data mining. We compare the learning based method with the traditional energy minimization based method for ARG similarity computation. The experimental results demonstrate that the learning-based method performs far better than the traditional methods.
Did you know that we have over 70,000 essays on 3,000 topics in our database?