为了正常的体验网站,请在浏览器设置里面开启Javascript功能!

(kd-tree) Multidimensional binary search trees used for associative searching

2011-12-06 9页 pdf 948KB 59阅读

用户头像

is_342256

暂无简介

举报
(kd-tree) Multidimensional binary search trees used for associative searching 1975 ACM Student Award Paper: Second Place Multidimensional Binary Search Trees Used for Associative Searching Jon Louis Bentley Stanford University This paper develops the multidimensional binary search tree (or k-d tree, where k is the dimensionality ...
(kd-tree) Multidimensional binary search trees used for associative searching
1975 ACM Student Award Paper: Second Place Multidimensional Binary Search Trees Used for Associative Searching Jon Louis Bentley Stanford University This paper develops the multidimensional binary search tree (or k-d tree, where k is the dimensionality of the search space) as a data structure for storage of information to be retrieved by associative searches. The k-d tree is defined and examples are given. It is shown to be quite efficient in its storage requirements. A signifi- cant advantage of this structure is that a single data structure can handle many types of queries very effici- ently. Various utility algorithms are developed; their proven average running times in an n record file are : in- sertion, O(log n); deletion of the root, 0 (n (k--1)/k) ; dele- tion of a random node, O(log n); and optimization (guar- antees logarithmic performance of searches), 0 (n log n). Search algorithms are given for partial match queries with t keys specified [proven maximum running time of O (n (k-t)/k) ] and for nearest neighbor queries [em- pirically observed average running time of O(log n). ] These performances far surpass the best currently known algorithms for these tasks. An algorithm is presented to handle any general intersection query. The main focus of this paper is theoretical. It is felt, however, that k-d trees could be quite useful in many applications, and ex- amples of potential uses are given. Key Words and Phrases: associative retrieval, binary search trees, key, attribute, information retrieval system, nearest neighbor queries, partial match queries, intersection queries, binary tree insertion CR Categories: 3.63, 3.70, 3.74, 4.49 Copyright (~) 1975, Association for Computing Machinery, Inc. General permission to republish, but not for profit, all or part of this material is granted provided that ACM's copyright notice is given and that reference is made to the publication, to its date of issue, and to the fact that reprinting privileges were granted by permission of the Association for Computing Machinery. This paper was awarded Second Place in ACM's 1975 George E. Forsythe Student Paper Competition. Author's present address: Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27514. 509 1. Introduction The problem of associative retrieval (often referred to as retrieval by secondary key) centers around a file F which is a collection of records. Each record R of F is an ordered k-tuple (Vo, vl, . . . , Vk_l) of values which are the keys, or attributes, of the record. A retrieval of records from F is initiated at the request of the user, which could be either mechanical or human. A retrieval request is called a query of the file, and specifies certain conditions to be satisfied by the keys of the records it re- quests to be retrieved from F. An information retrieval system must be capable of initiating an appropriate search upon the arrival of a query to that system. If a query is allowed to specify conditions dealing with a multiplicity of the keys, the searches performed by the system are considered associative. If the user of the sys- tem is restricted to specifying conditions for only one of the keys, the resulting search is not considered to be associative (in that case only one of the keys is consid- ered to be "the key" and the remaining attributes are referred to as "data"). Numerous methods exist for building an informa- tion retrieval system capable of handling such associa- tive queries. Among these are inverted files, methods of compounding attributes, superimposed coding systems, and combinatorial hashing. Knuth [5] discusses these ideas in detail. McCreight [6] has proposed that the keys be "shuffled" together bitwise; then unidimensional retrieval algorithms could be used to answer certain queries. Rivest investigated in his thesis [7] the use of binary search tries (see [5] for a detailed discussion of tries) to store records when they are composed of binary keys. Finkel and Bentley [3] discuss a data structure, quad trees, that stores the records of the file in a tree whose nodes have out-degree of 2k; theirs was the first general approach to use a tree structure. None of the above approaches seem to provide a "perfect" environ- ment for associative retrieval. Each of them falls short in some very important way, either in having only a small class of queries easily performed, large running time, large space requirements, horrible worst cases, or some other adverse properties. This paper presents a new type of data structure for associative searching, called the multidimensional bi- nary search tree or k-d tree, which is defined in Section 2. In Section 3 an efficient insertion algorithm is given and analyzed. Many types of associative searches are discussed in Section 4, and the k-d tree is shown to perform quite well in all of them. Its worst case per- formance in partial match searches is analyzed and is shown to equal the best previously attained average performance. A search algorithm is presented that can answer any intersection query. An algorithm given seems to solve the nearest neighbor problem in logarith- mic time; its running time is much less than any other known method. In Section 5 deletion is shown to be possible in k-d trees, and it is analyzed. Section 6 dis- Communications September 1975 of Volume 18 the ACM Number 9 cusses an optimization algorithm that efficiently trans- forms any collection of records into a k-d tree with optimal properties. With this background, a discussion of potential applications of k-d trees is presented in Section 7. Areas for further work are discussed in Section 8, and conclusions are drawn in Section 9. 2. Definitions and Notations I f a file is represented as a k-d tree, then each record in the file is stored as a node in the tree. In addition to the k keys which comprise the record, each node con- tains two pointers, which are either null or point to another node in the k-d tree. (Note that each pointer can be considered as specifying a subtree.) Associated with each node, though not necessarily stored as a field, is a discriminator, which is an integer between 0 and k -- 1, inclusive. Let us define a notation for these items: the k keys of node P will be called Ko(P) , . . . , Kk-I(P), the pointers will be LOSON(P) and HISON(P), and the discriminator will be DISC(P). The defining order imposed by a k-d tree is this: For any node P in a k-d tree, let j be DISC(P), then for any node Q in LOSON(P), it is true that Kj(Q) < K/P ) ; like- wise, for any node R in HISON(P), it is true that Kt(R) > Kt(P). (This statement does not take into account the possibility of equality of keys, which will be discussed shortly.) All nodes on any given level of the tree have the same discriminator. The root node has discriminator 0, its two sons have discriminator 1, and so on to the kth level on which the discriminator is k - 1 ; the (k + 1)-th level has discriminator 0, and the cycle repeatsl In general, NEXTDISC is a function defined as NEXTDISC(i) = (i + 1) mod k, and NEXTDISC(DISC(P)) = DISC(LOSON(P)), and likewise for HISON(P) (if they are non-null). Figure 1 gives an example of records in 2-space stored as nodes in a 2-d tree. The problem of equality of keys mentioned above arises in the definition of a function telling which son of P's to visit next while searching for node Q. This func- tion is written SUCCESSOR(P, Q) and returns either LOSON or HISON. Let j be DISC(P); if K/P) # Kj(Q), then it is clear by the defining property of k-d trees which son SUCCESSOR should return. If the two Kfls are equal, the decision must be based on the remain- ing keys. The choice of decision function is arbitrary, but for most applications the following method works nicely: Define a superkey of P by S/P) = Kt(P)Kt+x(P) . . . Kk-x(P)Ko(P) . . . Kt- x(P), the cyclical concatenation of all keys starting with Kt . I f S/Q) < St(P), SUCCESSOR returns LOSON; if St(Q) > S/P) , it returns HISON. I f S/Q) = S /P ) then all k keys are equal, and SUCCESSOR returns a special value to indicate that. Fig. 1. Records in 2-space stored as nodes in a 2-d tree. Records in 2-space stored as nodes in a 2-d tree (boxes represent range of subtree) : (0,100) E(40,85) B(IO,70) C(lO=,60) D(25,20) K 1 (0,0) %-,. (i00,i00) r p F(70,85) A(50,50) C(80,85) (100,0) Planar graph representation of the same 2-d tree (LOSON's are expressed by left branches, HISON's by right branches, and null sons by boxes): Dlscriminator 0 Let us define node Q to be j-greater than node P if and only if SUCCESSOR(P, Q) is HISON, and to be j-less than node R if and only if SUCCESSOR(R, Q) is LOSON. It is obvious how these definitions can be used to define the j -maximum and j -minimum elements of a collection of nodes. The j-distance between nodes P and Q is [ K/P) -- Kj(Q) l, and the j-nearest node of a set S to node P is the node Q in S with minimum j-distance between P and Q. If a multiplicity of the nodes have minimum j-distance from P, then the j- nearest node is defined as the j -minimum element in the subset of closest nodes which are j-greater than P (or similarly, the j -maximum element of the nodes which are j-less than P). We will typically use n to be the number of records in the file which is stored as a k-d tree, and therefore the number of nodes in the tree. We have already used k as the dimensionality of the records. The keys of all nodes in the subtree of any node, say P, in a k-d tree are known by P's position in the tree to be bounded by certain values. For instance, if P is in the HISON subtree of Q, and DISC(Q) is j, then all nodes in P are j-greater than Q; so for any R in HISON(P), K /R) >__ K /P) . To employ this knowledge in our algorithms, we will define a bounds array to hold the information. If B is a bounds array associated v~ith node P, then B has 2k entries, B (0) , . . . , B(2k -- 1). I f Q is a 510 Communications September 1975 of Volume 18 the ACM Number 9 pauldlx Typewriter 识别者 pauldlx Line pauldlx Line pauldlx Line pauldlx Line descendant of P, then it is true for all integers j E [0, k - 1] that B(2j) < Ks(Q) <_ B(2 j+ 1).Bounds array B is considered to be initialized if for all integers j E [0, k - l ] , B(2j) = -- ~ andB(2 j+ l ) = ~.For example, the bounds array representing node C in Figure 1 is (50, 100, 0, 100). This indicates that all k0 values in the subtree are bounded by 50 and 100, and all kl values are bounded by 0 and 100. It was noted that it is not necessary to store the discriminator as a field in each node, and one can see that it is easy to keep track of what kind of discriminator one is visiting as one descends in a k-d tree. With the idea in mind that it is superfluous to do so, we will store the discriminator in each node to make the algorithms we write more easily understandable. 3. Insertion In this section we will first describe an algorithm that inserts a node into a k-d tree, We will then analyze k-d trees and show that if the algorithm is used to insert random nodes into an initially empty tree the resulting tree will have the nice properties of a randomly built one-dimensional binary search tree. 3.1 An Insertion Algorithm The algorithm used to insert a node into a k-d tree is also used to search for a specific record in a tree. It is passed by a node, say P. If P is in the tree, the algorithm returns a pointer to P, and if P is not in the tree it returns A and inserts P into the tree. Algorithm INSERT describes one way of performing such an operation. Algorithm INSERT (k-d tree search and insertion) This algorithm is passed a node P, which is not in the tree (its HISON, LOSON, and DISC fields are not set). If there is a node in the tree with equal keys, the address of that node is returned; otherwise the node is inserted into the tree and A is returned. I1. [Check for null tree.] If ROOT = A then set ROOT *-- P, HISON(P) .-- A, LOSON(P) ~-- A, DISC(P) ~-- O, and return A; otherwise, set Q ~ ROOT (Q will move down the tree). I2. [Compare.] IfKi(P) = Ki(Q) for 0 < i < k - 1 (i.e. the nodes are equal) then return Q. Otherwise, set SON ~-- SUCCESSOR (Q,P) (SON will be HISON or LOSON). If SON(Q) = 5, then go to I4. I3. [Move down.] Set Q *-- SON(Q) and go to I2. I4. [Insert new node in tree.] Set SON(Q) ~-- P, H1SON(P) ~ A, LOSON(P) ~- A, DISC(P) ~-- NEXTDISC(DISC(Q)), return A. 3.2 Analysis of Randomly Built k-d Trees Consider a given binary tree of n nodes; our goal in this analysis of k-d trees is to show that the probability of constructing that tree by inserting n random nodes into an initially empty k-d tree is the same as the prob- ability of attaining that tree by random insertion into a one-dimensional binary search tree. Once we have shown this to be true, the theorems which have been 511 proved about one-dimensional binary search trees will be applicable to k-d trees. We must first define what we mean by random nodes. Since only the relative magnitudes of the keys and the order in which the records arrive are relevant for pur- poses of insertion, we can assume that the records to be inserted will be defined by a k-tuple of permutations of integers 1 , . . . , n. Then the first record to be inserted, say P, would be defined by Ko(P), the first element in the first permutation, and so on, to kk- l (P) , the first element in the kth permutation. The nodes will be con- sidered random if all of the (n!) k k-tuples of permuta- tions are equally likely to occur. Let us give each of the n nodes in the binary tree t a unique identification number which is an integer be- tween 1 and n. Define Si as the number of nodes in the subtree of t whose root is node i. To simplify our dis- cussion of null sons let us define the identification number of a null node to be n d- 1 ; thus Sn+l = 0. We will use Li as the number of nodes in the left subtree (or LOSON) of node i, and Hi as the number of nodes intheright subtree (or HISON); note Si = Li + Ri + 1. It is important to observe the following fact about the ordering in a collection of random nodes that are to be made into a k-d tree. The first node in the collection, say P (which is the first to be inserted in the tree), will become the root. This induces a partition of the remain- ing nodes into two subcollections: those nodes that will be in P's left (LOSON) subtree, and those that will be in the right (H ISON) subtree. If Q falls in the right subtree of P, and R falls in P's left subtree, then their relevant ordering (that is, whether or not Q precedes R) in the original collection is unimportant. The same tree would be built if P was the first element in the collection, and then came all the nodes that fell in P's right sub- tree, followed by all the nodes that fell in P's left sub- tree, as long as the orderings in the left and right sub- collections were maintained. A second important fact we will use is that when a collection of random nodes is split into two subcollections by this process, the result- ing subcollections are themselves random collections of nodes. This is due to the original independence of keys in a given record; the partitioning in no way disturbs the independence. After having made these two observations, it is easy to compute the probability that tree t as described above results when n random nodes are inserted into an ini- tially empty k-d tree. Assume that the root is a j-decider and that its identification number is i; then Si = n. The probability that the first record will partition the collec- tion into two subcollections having respective sizes Li and Ri is the probability that the jth key of the first element is the (Lid-1)-th in the set of all jth keys. Be- cause of the random nature of the nodes (all of the nodes are equally likely to be the first in the collection), this probability is 1/S i . Now we have reduced the problem; we know that the probability of t occurring is 1/Si times the probability of both the subcollections Communications September 1975 of Volume 18 the ACM Number 9 forming their respective subtrees. By the second observa- tion above, these probabilities are independent of the choice of the root of the tree and of one another. There- fore we can split the nodes into the left and right sub- collections (by the first observation) and apply this analysis recursively to the left son and right son, doing so as we visit each node in the tree once and only once. It is clear that the probability of t resulting is ek(t) = I I l /S,. l< i
/
本文档为【(kd-tree) Multidimensional binary search trees used for associative searching】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。 本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。 网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。

历史搜索

    清空历史搜索