Clustering Techniques in Oodbms (Using Objectstore)
Introduction Performance of a database can be greatly impacted by the manner in which data is loaded.This fact is true regardless of when the data is loaded; whether loaded before the application(s) begin accessing the data, or concurrently while the application(s) are accessing the data.This paper will present various strategies for locating data as it is loaded into the database and detail the performance implications of those strategies.
Data Clustering, Working Sets, and Performance With ObejctStore access to persistent data can perform at in-memory speeds.
In order to achieve in-memory speeds, one needs cache affinity. Cache affinity is the generic term that describes the degree to which data accessed within a program overlaps with data already retrieved on behalf of a previous request. Effective data clustering allows for better, if not optimal, cache affinity. Data density is defined as the proportion of objects within a given storage block that are accessed by a client during some scope of activation. Clustering is a technique to achieve high data density. The working set is defined as the set of database pages a client needs at a given time.
ObjectStore is a page-based architecture which performs best when the following goals are met: • Minimize the number of pages transferred between the client and server • Maximize the use of pages already in the cache In order to achieve these goals, the working set of the application should be optimal. The way to achieve an optimal working set is via data clustering. With good data clustering more data can be accessed in fewer pages; thus a high data density rate is obtained. A higher data density results in a smaller working set as well as a better chance of cache affinity. A smaller working set results in fewer page transfers.
The following sections in this paper will explain several clustering patterns/techniques for achieving better performance via cache affinity, higher data density and a smaller working set. NOTE: clustering is used in this paper as a concept of locality of reference. The term is not being used to refer to the physical storage unit available in ObjectStore. ObjectStore does present the user with a choice for location of allocations: with the database, within a particular segment, within a particular cluster. For the remainder of this paper, the discussion of cluster is a conceptual one, not the ObjectStore physical one.
Database Design Process Database design is one of the most important steps in designing and implementing an ObjectStore application. The following steps are pre-requisites for a database design: 1) Identify key use cases (ones which need to be fast and/or are run frequently) 2) Identify the object(s) used by the use cases called out in step 1 3) Identify the object(s) that are read or updated during the use cases called out in step 1 The focus of clustering efforts should be on the database objects which are used in the high priority use cases identified above.
Begin to cluster based on one use case, and then validate with others. The database design strategies which lend themselves to achieving the optimal working set are: • Clustering • Partitioning There are several different types of techniques which result in data being well clustered: • Isolate Index • Pooling • Object Modeling Data Clustering Clustering is a technique used to achieve high data density. Another definition of clustering is a grouping of objects together. If a use case requires objects A, B and C to operate, then those objects should be co-located for optimal data density.
If upon loading the database, those objects are physically allocated close to one another, then we say we have clustered those objects. Assume that the size of the three objects combined is less than the size of a physical database page. The clustering leads to high data density because when we fetch the page with object A, we will also get objects B and C. In this particular case, we need just one page transfer to get all objects required for our use case. To accomplish good clustering, one must know the use cases and the objects involved in those use cases.
Given that knowledge, the goals of clustering are: • Cluster objects together which are accessed together • Separate (de-cluster, or partition – we will discuss partitioning in detail later in this paper) objects which are never accessed together. This includes separating frequently accessed data from rarely accessed data. Partitioning Partitioning is a strategy to isolate subsets of objects in different physical storage units. By definition, if two objects are in different partitions, they are de-clustered. The two goals of partitioning are to gain isolation and to increase data density.
Isolation is desirable when concurrent access is required. The scope of this paper is not intended to cover concurrency. For that reason our discussion of partitioning will be rather brief. Although partitioning is intended for isolating objects, its use can improve data density. This may seem, by definition, to be counter intuitive. Let us use an example to illustrate. Imagine a grocery store. If you were in need of a box of cereal, you would go down the cereal aisle. If the grocer has done his job correctly, the aisle (or some number of shelves in the aisle) will be populated ONLY with boxes of cereal.
Because other items have been located in their respective aisles/shelves, the entire cereal aisle is dense with cereal. If the grocer had not done the job correctly, a given section of a shelf might have (for instance) boxes of noodles, cans of vegetables, and bags of chips. In this scenario, the shelf does not have good data density for the goal of obtaining a box of cereal. Recall the definition of data density: the proportion of objects within a given storage block that are accessed by a client during some scope. Our scope is to obtain a box of cereal.
Our storage block is the aisle or a shelf. If the shelf in question contains many items other than cereal, then we have poor data density. If, on the other hand, we partition the non-cereal items to be in different aisles, then the cereal aisle would contain only cereal and thus a high data density would Conclusion The way in which data is loaded into the database can have significant impact on the performance of an application. Careful analysis of the use cases for an application should allow key objects to be identified. Once key objects are identified, a clustering strategy can be planned.
Several of the techniques presented here can allow for a clustering strategy that will boost performance far beyond any tuning that might be done after the database is loaded and the application delivered. It is often the case that several techniques can be combined; an application need not restrict itself to the use of just one technique. The goal of clustering is to reduce your working set size; yield higher data density; and reduce the number of pages which need to be transferred between the application and the ObjectStore server.