Groupe d’études et de recherche en analyse des décisions

G-97-13

Concerning the NJ Algorithm and its Unweighted Version, UNJ

In this paper we will present UNJ, an unweighted version of the NJ algorithm (Saitou and Nei 1987; Studier and Keppler 1988). We will demonstrate that UNJ is well suited when the data are of the ( ij) = (dij ij) type, where dij is a tree distance, and when the ij are independent and identically distributed noise variables. Simulations confirm this theory. On a more general level, we will study the three main components of the agglomerative approach, applied to the reconstruction of tree distances. (i) We will demonstrate that the selection criterion for the pair to be agglomerated, used by NJ and UNJ, retains its meaning whatever the variances and covariances of the ij estimates. We will also provide a new proof of the correction of this criterion, based on an interpretation in acentrality terms proposed by Mirkin (1996). (ii) Using the results of Vach (1989), of which we will provide a simple new demonstration, we propose an analytical formulae which enables the correct least-squares estimation of edge lengths in O(n2) time, when n is the number of objects. (iii) We will provide a class of admissible reduction formulae which guarantee the finding of the true tree with additive data. We propose to choose, among these formulae, the minimum variance reduction, so that at each step we use estimates which are as reliable as possible in choosing the pair to be agglomerated. We will present the general solution, and apply it to the particular data model retained here.

, 34 pages