{"title": "Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms", "book": "Advances in Neural Information Processing Systems", "page_first": 1844, "page_last": 1852, "abstract": "Let $p$ be an unknown and arbitrary probability distribution over $[0 ,1)$. We consider the problem of \\emph{density estimation}, in which a learning algorithm is given i.i.d. draws from $p$ and must (with high probability) output a hypothesis distribution that is close to $p$. The main contribution of this paper is a highly efficient density estimation algorithm for learning using a variable-width histogram, i.e., a hypothesis distribution with a piecewise constant probability density function. In more detail, for any $k$ and $\\eps$, we give an algorithm that makes $\\tilde{O}(k/\\eps^2)$ draws from $p$, runs in $\\tilde{O}(k/\\eps^2)$ time, and outputs a hypothesis distribution $h$ that is piecewise constant with $O(k \\log^2(1/\\eps))$ pieces. With high probability the hypothesis $h$ satisfies $\\dtv(p,h) \\leq C \\cdot \\opt_k(p) + \\eps$, where $\\dtv$ denotes the total variation distance (statistical distance), $C$ is a universal constant, and $\\opt_k(p)$ is the smallest total variation distance between $p$ and any $k$-piecewise constant distribution. The sample size and running time of our algorithm are both optimal up to logarithmic factors. The ``approximation factor'' $C$ that is present in our result is inherent in the problem, as we prove that no algorithm with sample size bounded in terms of $k$ and $\\eps$ can achieve $C < 2$ regardless of what kind of hypothesis distribution it uses.", "full_text": "Near\u2013Optimal Density Estimation in Near\u2013Linear\n\nTime Using Variable\u2013Width Histograms\n\nSiu-On Chan\n\nMicrosoft Research\n\nsochan@gmail.com\n\nIlias Diakonikolas\n\nUniversity of Edinburgh\nilias.d@ed.ac.uk\n\nRocco A. Servedio\nColumbia University\n\nrocco@cs.columbia.edu\n\nXiaorui Sun\n\nColumbia University\n\nxiaoruisun@cs.columbia.edu\n\nAbstract\n\nLet p be an unknown and arbitrary probability distribution over [0, 1). We con-\nsider the problem of density estimation, in which a learning algorithm is given\ni.i.d. draws from p and must (with high probability) output a hypothesis distri-\nbution that is close to p. The main contribution of this paper is a highly ef\ufb01cient\ndensity estimation algorithm for learning using a variable-width histogram, i.e., a\nhypothesis distribution with a piecewise constant probability density function.\nIn more detail, for any k and \", we give an algorithm that makes \u02dcO(k/\"2) draws\nfrom p, runs in \u02dcO(k/\"2) time, and outputs a hypothesis distribution h that is piece-\nwise constant with O(k log2(1/\")) pieces. With high probability the hypothesis\nh satis\ufb01es dTV(p, h) \uf8ff C \u00b7 optk(p) + \", where dTV denotes the total variation\ndistance (statistical distance), C is a universal constant, and optk(p) is the small-\nest total variation distance between p and any k-piecewise constant distribution.\nThe sample size and running time of our algorithm are optimal up to logarithmic\nfactors. The \u201capproximation factor\u201d C in our result is inherent in the problem,\nas we prove that no algorithm with sample size bounded in terms of k and \" can\nachieve C < 2 regardless of what kind of hypothesis distribution it uses.\n\n1\n\nIntroduction\n\nConsider the following fundamental statistical task: Given independent draws from an unknown\nprobability distribution, what is the minimum sample size needed to obtain an accurate estimate of\nthe distribution? This is the question of density estimation, a classical problem in statistics with a\nrich history and an extensive literature (see e.g., [BBBB72, DG85, Sil86, Sco92, DL01]). While this\nbroad question has mostly been studied from an information\u2013theoretic perspective, it is an inherently\nalgorithmic question as well, since the ultimate goal is to describe and understand algorithms that are\nboth computationally and information-theoretically ef\ufb01cient. The need for computationally ef\ufb01cient\nlearning algorithms is only becoming more acute with the recent \ufb02ood of data across the sciences;\nthe \u201cgold standard\u201d in this \u201cbig data\u201d context is an algorithm with information-theoretically (near-)\noptimal sample size and running time (near-) linear in its sample size.\nIn this paper we consider learning scenarios in which an algorithm is given an input data set which\nis a sample of i.i.d. draws from an unknown probability distribution. It is natural to expect (and can\nbe easily formalized) that, if the underlying distribution of the data is inherently \u201ccomplex\u201d, it may\nbe hard to even approximately reconstruct the distribution. But what if the underlying distribution\nis \u201csimple\u201d or \u201csuccinct\u201d \u2013 can we then reconstruct the distribution to high accuracy in a computa-\ntionally and sample-ef\ufb01cient way? In this paper we answer this question in the af\ufb01rmative for the\n\n1\n\n\fproblem of learning \u201cnoisy\u201d histograms, arguably one of the most basic density estimation problems\nin the literature.\nTo motivate our results, we begin by brie\ufb02y recalling the role of histograms in density estimation.\nHistograms constitute \u201cthe oldest and most widely used method for density estimation\u201d [Sil86], \ufb01rst\nintroduced by Karl Pearson in [Pea95]. Given a sample from a probability density function (pdf)\np, the method partitions the domain into a number of intervals (bins) B1, . . . , Bk, and outputs the\n\u201cempirical\u201d pdf which is constant within each bin. A k-histogram is a piecewise constant distribution\nover bins B1, . . . , Bk, where the probability mass of each interval Bj, j 2 [k], equals the fraction of\nobservations in the interval. Thus, the goal of the \u201chistogram method\u201d is to approximate an unknown\npdf p by an appropriate k-histogram. It should be emphasized that the number k of bins to be used\nand the \u201cwidth\u201d and location of each bin are unspeci\ufb01ed; they are parameters of the estimation\nproblem and are typically selected in an ad hoc manner.\nWe study the following distribution learning question:\n\nSuppose that there exists a k-histogram that provides an accurate approximation\nto the unknown target distribution. Can we ef\ufb01ciently \ufb01nd such an approximation?\n\nIn this paper, we provide a fairly complete af\ufb01rmative answer to this basic question. Given a bound\nk on the number of intervals, we give an algorithm that uses a near-optimal sample size, runs in\nnear-linear time (in its sample size), and approximates the target distribution nearly as accurately as\nthe best k-histogram.\nTo formally state our main result, we will need a few de\ufb01nitions. We work in a standard model of\nlearning an unknown probability distribution from samples, essentially that of [KMR+94], which\nis a natural analogue of Valiant\u2019s well-known PAC model for learning Boolean functions [Val84] to\nthe unsupervised setting of learning an unknown probability distribution.1 A distribution learning\nproblem is de\ufb01ned by a class C of distributions over a domain \u2326. The algorithm has access to\nindependent draws from an unknown pdf p, and its goal is to output a hypothesis distribution h\nthat is \u201cclose\u201d to the target distribution p. We measure the closeness between distributions using\nthe statistical distance or total variation distance. In the \u201cnoiseless\u201d setting, we are promised that\np 2 C and the goal is to construct a hypothesis h such that (with high probability) the total variation\ndistance dTV(h, p) between h and p is at most \", where \" > 0 is the accuracy parameter.\nThe more challenging \u201cnoisy\u201d or agnostic model captures the situation of having arbitrary (or even\nadversarial) noise in the data. In this setting, we do not make any assumptions about the target den-\nsity p and the goal is to \ufb01nd a hypothesis h that is almost as accurate as the \u201cbest\u201d approximation of p\nby any distribution in C. Formally, given sample access to a (potentially arbitrary) target distribution\np and \" > 0, the goal of an agnostic learning algorithm for C is to compute a hypothesis distribution\nh such that dTV(h, p) \uf8ff \u21b5 \u00b7 optC(p) + \", where optC(p) := inf q2C dTV(q, p) \u2013 i.e., optC(p) is\nthe statistical distance between p and the closest distribution to it in C \u2013 and \u21b5 1 is a constant\n(that may depend on the class C). We will call such a learning algorithm an \u21b5-agnostic learning\nalgorithm for C; when \u21b5 > 1 we sometimes refer to this as a semi-agnostic learning algorithm.\nA distribution f over a \ufb01nite interval I \u2713 R is called k-\ufb02at if there exists a partition of I into k\nintervals I1, . . . , Ik such that the pdf f is constant within each such interval. We henceforth (without\nloss of generality for densities with bounded support) restrict ourselves to the case I = [0, 1). Let\nCk be the class of all k-\ufb02at distributions over [0, 1). For a (potentially arbitrary) distribution p over\n[0, 1) we will denote by optk(p) := inf f2Ck dTV(f, p).\nIn this terminology, our learning problem is exactly the problem of agnostically learning the class\nof k-\ufb02at distributions. Our main positive result is a near-optimal algorithm for this problem, i.e.,\na semi-agnostic learning algorithm that has near-optimal sample size and near-linear running time.\nMore precisely, we prove the following:\nTheorem 1 (Main). There is an algorithm A with the following property: Given k 1, \" > 0,\nand sample access to a target distribution p, algorithm A uses \u02dcO(k/\"2) independent draws from\np, runs in time \u02dcO(k/\"2), and outputs a O(k log2(1/\"))-\ufb02at hypothesis distribution h that satis\ufb01es\ndTV(h, p) \uf8ff O(optk(p)) + \" with probability at least 9/10.\n\n1We remark that our model is essentially equivalent to the \u201cminimax rate of convergence under the L1\n\ndistance\u201d in statistics [DL01], and our results carry over to this setting as well.\n\n2\n\n\fUsing standard techniques, the con\ufb01dence probability can be boosted to 1 , for any > 0, with\na (necessary) overhead of O(log(1/)) in the sample size and the running time.\nWe emphasize that the dif\ufb01culty of our result lies in the fact that the \u201coptimal\u201d piecewise constant\ndecomposition of the domain is both unknown and approximate (in the sense that optk(p) > 0);\nand that our algorithm is both sample-optimal and runs in (near-) linear time. Even in the (signi\ufb01-\ncantly easier) case that the target p 2 Ck (i.e., optk(p) = 0), and the optimal partition is explicitly\ngiven to the algorithm, it is known that a sample of size \u2326(k/\"2) is information-theoretically nec-\nessary. (This lower bound can, e.g., be deduced from the standard fact that learning an unknown\ndiscrete distribution over a k-element set to statistical distance \" requires an \u2326(k/\"2) size sample.)\nHence, our algorithm has provably optimal sample complexity (up to a logarithmic factor), runs in\nessentially sample linear time, and is \u21b5-agnostic for a universal constant \u21b5 > 1.\nIt should be noted that the sample size required for our problem is well-understood; it follows from\nthe VC theorem (Theorem 3) that O(k/\"2) draws from p are information-theoretically suf\ufb01cient.\nHowever, the theorem is non-constructive, and the \u201cobvious\u201d algorithm following from it has run-\nning time exponential in k and 1/\". In recent work, Chan et al [CDSS14] presented an approach\nemploying an intricate combination of dynamic programming and linear programming which yields\na poly(k/\") time algorithm for the above problem. However, the running time of the [CDSS14] al-\ngorithm is \u2326(k3) even for constant values of \", making it impractical for applications. As discussed\nbelow our algorithmic approach is signi\ufb01cantly different from that of\n[CDSS14], using neither\ndynamic nor linear programming.\nApplications. Nonparametric density estimation for shape restricted classes has been a subject\nof study in statistics since the 1950\u2019s (see [BBBB72] for an early book on the topic and [Gre56,\nBru58, Rao69, Weg70, HP76, Gro85, Bir87] for some of the early literature), and has applications\nto a range of areas including reliability theory (see [Reb05] and references therein). By using the\nstructural approximation results of Chan et al [CDSS13], as an immediate corollary of Theorem 1\nwe obtain sample optimal and near-linear time estimators for various well-studied classes of shape\nrestricted densities including monotone, unimodal, and multimodal densities (with unknown mode\nlocations), monotone hazard rate (MHR) distributions, and others (because of space constraints we\ndo not enumerate the exact descriptions of these classes or statements of these results here, but\ninstead refer the interested reader to [CDSS13]). Birg\u00b4e [Bir87] obtained a sample optimal and linear\ntime estimator for monotone densities, but prior to our work, no linear time and sample optimal\nestimator was known for any of the other classes.\nOur algorithm from Theorem 1 is \u21b5-agnostic for a constant \u21b5 > 1. It is natural to ask whether a\nsigni\ufb01cantly stronger accuracy guarantee is ef\ufb01ciently achievable; in particular, is there an agnostic\nalgorithm with similar running time and sample complexity and \u21b5 = 1? Perhaps surprisingly, we\nprovide a negative answer to this question. Even in the simplest nontrivial case that k = 2, and the\ntarget distribution is de\ufb01ned over a discrete domain [N ] = {1, . . . , N}, any \u21b5-agnostic algorithm\nwith \u21b5 < 2 requires large sample size:\n\nTheorem 2 (Lower bound, Informal statement). Any 1.99-agnostic learning algorithm for 2-\ufb02at\ndistributions over [N ] requires a sample of size \u2326(pN ).\n\nSee Theorem 7 in Section 4 for a precise statement. Note that there is an exact correspondence be-\ntween distributions over the discrete domain [N ] and pdf\u2019s over [0, 1) which are piecewise constant\non each interval of the form [k/N, (k + 1)/N ) for k 2 {0, 1, . . . , N 1}. Thus, Theorem 2 implies\nthat no \ufb01nite sample algorithm can 1.99-agnostically learn even 2-\ufb02at distributions over [0, 1). (See\nCorollary 4.1 in Section 4 for a detailed statement.)\nRelated work. A number of techniques for density estimation have been developed in the mathemat-\nical statistics literature, including kernels and variants thereof, nearest neighbor estimators, orthog-\nonal series estimators, maximum likelihood estimators (MLE), and others (see Chapter 2 of [Sil86]\nfor a survey of existing methods). The main focus of these methods has been on the statistical rate\nof convergence, as opposed to the running time of the corresponding estimators. We remark that\nthe MLE does not exist for very simple classes of distributions (e.g., unimodal distributions with\nan unknown mode, see e.g, [Bir97]). We note that the notion of agnostic learning is related to the\nliterature on model selection and oracle inequalities [MP007], however this work is of a different\n\ufb02avor and is not technically related to our results.\n\n3\n\n\fHistograms have also been studied extensively in various areas of computer science, including\ndatabases and streaming [JKM+98, GKS06, CMN98, GGI+02] under various assumptions about\nthe input data and the precise objective. Recently, Indyk et al [ILR12] studied the problem of learn-\ning a k-\ufb02at distribution over [N ] under the L2 norm and gave an ef\ufb01cient algorithm with sample\ncomplexity O(k2 log(N )/\"4). Since the L1 distance is a stronger metric, Theorem 1 implies an\nimproved sample and time bound of \u02dcO(k/\"2) for their setting.\n\n2 Preliminaries\n\nThroughout the paper we assume that the underlying distributions have Lebesgue measurable den-\nsities. For a pdf p : [0, 1) ! R+ and a Lebesgue measurable subset A \u2713 [0, 1), i.e., A 2 L([0, 1)),\nwe use p(A) to denoteRz2A p(z). The statistical distance or total variation distance between two\ndensities p, q : [0, 1) ! R+ is dTV(p, q) := supA2L([0,1)) |p(A) q(A)|. The statistical distance\n2kp qk1 where kp qk1, the L1 distance between p and q,\nsatis\ufb01es the identity dTV(p, q) = 1\nisR[0,1) |p(x) q(x)|dx; for convenience in the rest of the paper we work with L1 distance. We\nrefer to a nonnegative function p over an interval (which need not necessarily integrate to one over\nthe interval) as a \u201csub-distribution.\u201d Given a value \uf8ff > 0, we say that a (sub-)distribution p over\n[0, 1) is \uf8ff-well-behaved if supx2[0,1) Prx\u21e0p[x] \uf8ff \uf8ff, i.e., no individual real value is assigned more\nthan \uf8ff probability under p. Any probability distribution with no atoms is \uf8ff-well-behaved for all\n\uf8ff > 0. Our results apply for general distributions over [0, 1) which may have an atomic part as well\nas a non-atomic part. Given m independent draws s1, . . . , sm from a distribution p over [0, 1), the\n\nempirical distributionbpm over [0, 1) is the discrete distribution supported on {s1, . . . , sm} de\ufb01ned\nas follows: for all z 2 [0, 1), Prx\u21e0bpm[x = z] = |{j 2 [m] | sj = z}|/m.\nThe VC inequality. Let p : [0, 1) ! R be a Lebesgue measurable function. Given a family of\nsubsets A \u2713 L([0, 1)) over [0, 1), de\ufb01ne kpkA = supA2A |p(A)|. The VC dimension of A is\nthe maximum size of a subset X \u2713 [0, 1) that is shattered by A (a set X is shattered by A if for\nevery Y \u2713 X, some A 2 A satis\ufb01es A \\ X = Y ). If there is a shattered subset of size s for all\ns 2 +, then we say that the VC dimension of A is 1. The well-known Vapnik-Chervonenkis (VC)\ninequality states the following:\nTheorem 3 (VC inequality, [DL01, p.31]). Let p : I ! R+ be a probability density function over\nI \u2713 R andbpm be the empirical distribution obtained after drawing m points from p. Let A \u2713 2I be\na family of subsets with VC dimension d. Then E[kp bpmkA] \uf8ff O(pd/m).\n\nPartitioning into intervals of approximately equal mass. As a basic primitive, given access to\na sample drawn from a \uf8ff-well-behaved target distribution p over [0, 1), we will need to partition\n[0, 1) into \u21e5(1/\uf8ff) intervals each of which has probability \u21e5(\uf8ff) under p. There is a simple algo-\nrithm, based on order statistics, which does this and has the following performance guarantee (see\nAppendix A.2 of [CDSS14]):\nLemma 2.1. Given \uf8ff 2 (0, 1) and access to points drawn from a \uf8ff/64-well-behaved distribution\np over [0, 1), the procedure Approximately-Equal-Partition draws O((1/\uf8ff) log(1/\uf8ff))\npoints from p, runs in time \u02dcO(1/\uf8ff), and with probability at least 99/100 outputs a partition of [0, 1)\ninto ` = \u21e5(1/\uf8ff) intervals such that p(Ij) 2 [\uf8ff/2, 3\uf8ff] for all 1 \uf8ff j \uf8ff `.\n\n3 The algorithm and its analysis\n\nIn this section we prove our main algorithmic result, Theorem 1. Our approach has the following\nhigh-level structure: In Section 3.1 we give an algorithm for agnostically learning a target distri-\nbution p that is \u201cnice\u201d in two senses: (i) p is well-behaved (i.e., it does not have any heavy atomic\nelements), and (ii) optk(p) is bounded from above by the error parameter \". In Section 3.2 we give a\ngeneral ef\ufb01cient reduction showing how the second assumption can be removed, and in Section 3.3\nwe brie\ufb02y explain how the \ufb01rst assumption can be removed, thus yielding Theorem 1.\n\n4\n\n\f3.1 The main algorithm\n\nIn this section we give our main algorithmic result, which handles well-behaved distributions p for\nwhich optk(p) is not too large:\nTheorem 4. There is an algorithm Learn-WB-small-opt-k-histogram that given as input\n\u02dcO(k/\"2) i.i.d. draws from a target distribution p and a parameter \" > 0, runs in time \u02dcO(k/\"2), and\nhas the following performance guarantee: If (i) p is \"/ log(1/\")\n-well-behaved, and (ii) optk(p) \uf8ff \",\nthen with probability at least 19/20, it outputs an O(k \u00b7 log2(1/\"))-\ufb02at distribution h such that\ndTV(p, h) \uf8ff 2 \u00b7 optk(p) + 3\".\nWe require some notation and terminology. Let r be a distribution over [0, 1), and let P be a set of\ndisjoint intervals that are contained in [0, 1). We say that the P-\ufb02attening of r, denoted (r)P, is the\nsub-distribution de\ufb01ned as\n\n384k\n\nr(v) =\u21e2 r(I)/|I|\n\n0\n\nif v 2 I, I 2 P\nif v does not belong to any I 2 P\n\nObserve that if P is a partition of [0, 1), then (since r is a distribution) (r)P is a distribution.\nWe say that two intervals I, I0 are consecutive if I = [a, b) and I0 = [b, c). Given two consecutive\nintervals I, I0 contained in [0, 1) and a sub-distribution r, we use \u21b5r(I, I0) to denote the L1 distance\nbetween (r){I,I0} and (r){I[I0}, i.e., \u21b5r(I, I0) =RI[I0 |(r){I,I0}(x) (r){I[I0}(x)|dx. Note here\nthat {I [ I0} is a set that contains one element, the interval [a, c).\n3.1.1 Intuition for the algorithm\nWe begin with a high-level intuitive explanation of the Learn-WB-small-opt-k-histogram\nalgorithm.\nIt starts in Step 1 by constructing a partition of [0, 1) into z = \u21e5(k/\"0) intervals\nI1, . . . , Iz (where \"0 = \u02dc\u21e5(\")) such that p has weight \u21e5(\"0/k) on each subinterval. In Step 2 the\nalgorithm draws a sample of \u02dcO(k/\"2) points from p and uses them to de\ufb01ne an empirical distri-\n\nbution bpm. This is the only step in which points are drawn from p. For the rest of this intuitive\nexplanation we pretend that the weightbp(I) that the empirical distributionbpm assigns to each inter-\n\nval I is actually the same as the true weight p(I) (Lemma 3.1 below shows that this is not too far\nfrom the truth).\nBefore continuing with our explanation of the algorithm, let us digress brie\ufb02y by imagining for a\nmoment that the target distribution p actually is a k-\ufb02at distribution (i.e., that optk(p) = 0). In this\n\ngiven these it is not dif\ufb01cult to construct a high-accuracy hypothesis).\n\ncase there are at most k \u201cbreakpoints\u201d, and hence at most k intervals Ij for which \u21b5bpm(Ij, Ij+1) > 0,\nso computing the \u21b5bpm(Ij, Ij+1) values would be an easy way to identify the true breakpoints (and\nIn reality, we may of course have optk(p) > 0; this means that if we try to use the \u21b5bpm(Ij, Ij+1)\ncriterion to identify \u201cbreakpoints\u201d of the optimal k-\ufb02at distribution that is closest to p (call this k-\ufb02at\ndistribution q), we may sometimes be \u201cfooled\u201d into thinking that q has a breakpoint in an interval\nIj where it does not (but rather the value \u21b5bpm(Ij, Ij+1) is large because of the difference between\nq and p). However, recall that by assumption we have optk(p) \uf8ff \"; this bound can be used to\nshow that there cannot be too many intervals Ij for which a large value of \u21b5bpm(Ij, Ij+1) suggests\na \u201cspurious breakpoint\u201d (see the proof of Lemma 3.3). This is helpful, but in and of itself not\nenough; since our partition I1, . . . , Iz divides [0, 1) into k/\"0 intervals, a naive approach based on\nthis would result in a (k/\"0)-\ufb02at hypothesis distribution, which in turn would necessitate a sample\ncomplexity of \u02dcO(k/\"03), which is unacceptably high.\nInstead, our algorithm performs a careful\n\nprocess of iteratively merging consecutive intervals for which the \u21b5bpm(Ij, Ij+1) criterion indicates\nthat a merge will not adversely affect the \ufb01nal accuracy by too much. As a result of this process\nwe end up with k \u00b7 polylog(1/\") intervals for the \ufb01nal hypothesis, which enables us to output a\n(k \u00b7 polylog(1/\"0))-\ufb02at \ufb01nal hypothesis using \u02dcO(k/\"02) draws from p.\nIn more detail, this iterative merging is carried out by the main loop of the algorithm in Step 4.\nGoing into the t-th iteration of the loop, the algorithm has a partition Pt1 of [0, 1) into disjoint\nsub-intervals, and a set Ft1 \u2713 Pt1 (i.e., every interval belonging to Ft1 also belongs to Pt1).\nInitially P0 contains all the intervals I1, . . . , Iz and F0 is empty. Intuitively, the intervals in Pt1 \\\n\n5\n\n\fFt1 are still being \u201cprocessed\u201d; such an interval may possibly be merged with a consecutive interval\nfrom Pt1 \\ Ft1 if doing so would only incur a small \u201ccost\u201d (see condition (iii) of Step 4(b) of the\nalgorithm).The intervals in Ft1 have been \u201cfrozen\u201d and will not be altered or used subsequently in\nthe algorithm.\n\n3.1.2 The algorithm\n\nAlgorithm Learn-WB-small-opt-k-histogram:\nInput: parameters k 1, \" > 0; access to i.i.d. draws from target distribution p over [0, 1)\nOutput: If (i) p is \"/ log(1/\")\n99/100 the output is a distribution q such that dTV(p, q) \uf8ff 2optk(p) + 3\".\n\n-well-behaved and (ii) optk(p) \uf8ff \", then with probability at least\n\n384k\n\n1. Let \"0 = \"/ log(1/\"). Run Algorithm Approximately-Equal-Partition on\ninput parameter \"0\n6k to partition [0, 1) into z = \u21e5(k/\"0) intervals I1 = [i0, i1), . . . ,\nIz = [iz1, iz), where i0 = 0 and iz = 1,\nsuch that with probability at least\n99/100, for each j 2 {1, . . . , z} we have p([ij1, ij)) 2 [\"0/12k, \"0/2k] (assuming p\nis \"0/(384k)-well-behaved).\n\n2. Draw m = \u02dcO(k/\"02) points from p and letbpm be the resulting empirical distribution.\n\n3. Set P0 = {I1, I2, . . . Iz}, and F0 = ;.\n4. Let s = log2\n\n\"0 . Repeat for t = 1, . . . until t = s:\n\n1\n\n(c) Initialize i to 1, and repeatedly execute one of the following four (mutually ex-\n\n(a) Initialize Pt to ; and Ft to Ft1.\n(b) Without loss of generality, assume Pt1 = {It1,1, . . . , It1,zt1} where inter-\nval It1,i is to the left of It1,i+1 for all i. Scan left to right across the intervals\nin Pt1 (i.e., iterate over i = 1, . . . , zt1 1). If intervals It1,i, It1,i+1 are (i)\nboth not in Ft1, and (ii) \u21b5bpm(It1,i, It1,i+1) > \"0/(2k), then add both It1,i\nand It1,i+1 into Ft.\nclusive and exhaustive) cases until i > zt1:\n[Case 1] i \uf8ff zt1 1 and It1,i = [a, b), It1,i+1 = [b, c) are consecutive\nintervals both not in Ft. Add the merged interval It1,i [ It1,i+1 = [a, c) into\nPt. Set i i + 2.\n[Case 2] i \uf8ff zt1 1 and It1,i 2 Ft. Set i i + 1.\n[Case 3] i \uf8ff zt1 1, It1,i /2 Ft and It1,i+1 2 Ft. Add It1,i into Ft and\nset i i + 2.\n[Case 4] i = zt1. Add It1,zt1 into Ft if It1,zt1 is not in Ft and set i \ni + 1.\n\n(d) Set Pt Pt [ Ft.\n\n5. Output the |Ps|-\ufb02at hypothesis distribution (bpm)Ps.\n\n3.1.3 Analysis of the algorithm and proof of Theorem 4\nIt is straightforward to verify the claimed running time given Lemma 2.1, which bounds the running\ntime of Approximately-Equal-Partition.\nIndeed, we note that Step 2, which simply\ndraws \u02dcO(k/\"02) points and constructs the resulting empirical distribution, dominates the overall\nrunning time. In the rest of this subsubsection we prove correctness.\n\nhigh-accuracy estimate of the true probability of any union of consecutive intervals from I1, . . . , Iz.\nThe following lemma from [CDSS14] follows from the standard multiplicative Chernoff bound:\nLemma 3.1 (Lemma 12, [CDSS14]). With probability 99/100 over the sample drawn in Step 2, for\n\nWe \ufb01rst observe that with high probability the empirical distributionbpm de\ufb01ned in Step 2 gives a\nevery 0 \uf8ff a < b \uf8ff z we have that |bpm([ia, ib)) p([ia, ib))| \uf8ffp\"0(b a) \u00b7 \"0/(10k).\nfor all 0 \uf8ff a < b \uf8ff z. We use this to show that the \u21b5bpm(It1,i, It1,i+1) value that the algorithm\n\nWe henceforth assume that this 99/100-likely event indeed takes place, so the above inequality holds\n\n6\n\n\fuses in Step 4(b) is a good proxy for the actual value \u21b5p(It1,i, It1,i+1) (which of course is not\naccessible to the algorithm):\nLemma 3.2. Fix 1 \uf8ff t \uf8ff s. Then we have |\u21b5bpm(It1,i, It1,i+1) \u21b5p(It1,i, It1,i+1)| \uf8ff\n\n2\"0/(5k).\n\nDue to space constraints the proofs of all lemmas in this section are deferred to Appendix A.\nFor the rest of the analysis, let q denote a \ufb01xed k-\ufb02at distribution that is closest to p, so kp qk1 =\noptk(p). (We note that while optk(p) is de\ufb01ned as inf q2C kp qk1, standard closure arguments\ncan be used to show that the in\ufb01mum is actually achieved by some k-\ufb02at distribution q.) Let Q be\nthe partition of [0, 1) corresponding to the intervals on which q is piecewise constant. We say that a\nbreakpoint of Q is a value in [0, 1] that is an endpoint of one of the (at most) k intervals in Q.\nThe following important lemma bounds the number of intervals in the \ufb01nal partition Ps:\nLemma 3.3. Ps contains at most O(k log2(1/\")) intervals.\nThe following de\ufb01nition will be useful:\nDe\ufb01nition 5. Let P denote any partition of [0, 1). We say that partition P is \"0-good for (p, q) if for\nevery breakpoint v of Q, the interval I in P containing v satis\ufb01es p(I) \uf8ff \"0/(2k).\nThe above de\ufb01nition is justi\ufb01ed by the following lemma:\nLemma 3.4. If P is \"0-good for (p, q), then kp (p)Pk1 \uf8ff 2optk(p) + \"0.\nWe are now in a position to prove the following:\nLemma 3.5. There exists a partition R of [0, 1) that is \"0-good for (p, q) and satis\ufb01es\n\nk(p)Ps (p)Rk1 \uf8ff \".\n\nWe construct the claimed R based on Ps,Ps1, . . . ,P0 as follows: (i) If I is an interval in Ps not\ncontaining a breakpoint of Q, then I is also in R; (ii) If I is an interval in Ps that does contain a\nbreakpoint of Q, then we further partition I into a set of intervals S in a recursive manner using\nPs1, . . . ,P0 (see Appendix A.4). Finally, by putting everything together we can prove Theorem 4:\nProof of Theorem 4. By Lemma 3.4 applied to R, we have that kp (p)Rk1 \uf8ff 2optk(p) + \"0. By\nLemma 3.5, we have that k(p)Ps(p)Rk1 \uf8ff \"; thus the triangle inequality gives that kp(p)Psk1 \uf8ff\n2optk(p) + 2\". By Lemma 3.3 the partition Ps contains at most O(k log2(1/\")) intervals, so both\n\nof unions of up to ` intervals (which has VC dimension 2`). Consequently by the VC inequality\n\n(p)Ps and (bpm)Ps are O(k log2(1/\"))-\ufb02at distributions. Thus, k(p)Ps (bpm)Psk1 = k(p)Ps \n(bpm)PskA`, where ` = O(k log2(1/\")) and A` is the family of all subsets of [0, 1) that consist\n(Theorem 3, for a suitable choice of m = \u02dcO(k/\"02), we have that E[k(p)Ps(bpm)Psk1] \uf8ff 4\"0/100.\nMarkov\u2019s inequality now gives that with probability at least 96/100, we have k(p)Ps (bpm)Psk1 \uf8ff\nLemma 3.1), we have that kp (bpm)Psk1 \uf8ff 2optk(p) + 3\", and the theorem is proved.\n\n\"0. Hence, with overall probability at least 19/20 (recall the 1/100 error probability incurred in\n\n3.2 A general reduction to the case of small opt for semi-agnostic learning\n\nIn this section we show that under mild conditions, the general problem of agnostic distribution\nis not too large\nlearning for a class C can be ef\ufb01ciently reduced to the special case when optC\ncompared with \". While the reduction is simple and generic, we have not previously encountered it\nin the literature on density estimation, so we provide a proof in Appendix A.5. A precise statement\nof the reduction follows:\nTheorem 6. Let A be an algorithm with the following behavior: A is given as input i.i.d. points\ndrawn from p and a parameter \" > 0. A uses m(\") = \u2326(1/\") draws from p, runs in time t(\") =\n\u2326(1/\"), and satis\ufb01es the following: if optC(p) \uf8ff 10\", then with probability at least 19/20 it outputs\na hypothesis distribution q such that (i) kp qk1 \uf8ff \u21b5\u00b7 optC(p) + \", where \u21b5 is an absolute constant,\nand (ii) given any r 2 [0, 1), the value q(r) of the pdf of q at r can be ef\ufb01ciently computed in T time\nsteps.\n\n7\n\n\fThen there is an algorithm A0 with the following performance guarantee: A0 is given as input i.i.d.\ndraws from p and a parameter \" > 0.2 Algorithm A0 uses O(m(\"/10) + log log(1/\")/\"2) draws\nfrom p, runs in time O(t(\"/10)) + T \u00b7 \u02dcO(1/\"2), and outputs a hypothesis distribution q0 such that\nwith probability at least 39/40 we have kp q0k1 \uf8ff 10(\u21b5 + 2) \u00b7 optC(p) + \".\n3.3 Dealing with distributions that are not well behaved\n\nThe assumption that the target distribution p is \u02dc\u21e5(\"/k)-well-behaved can be straightforwardly re-\nmoved by following the approach in Section 3.6 of [CDSS14]. That paper presents a simple linear-\ntime sampling-based procedure, using \u02dcO(k/\") samples, that with high probability identi\ufb01es all the\n\u201cheavy\u201d elements (atoms which cause p to not be well-behaved, if any such points exist).\nOur overall algorithm \ufb01rst runs this procedure to \ufb01nd the set S of \u201cheavy\u201d elements, and then runs\nthe algorithm presented above (which succeeds for well-behaved distributions, i.e., distributions\nthat have no \u201cheavy\u201d elements) using as its target distribution the conditional distribution of p over\n[0, 1) \\ S (let us denote this conditional distribution by p0). A straightforward analysis given in\n[CDSS14] shows that (i) optk(p) optk(p0), and moreover (ii) dTV(p, p0) \uf8ff optk(p). Thus, by\nthe triangle inequality, any hypothesis h satisfying dTV(h, p0) \uf8ff Coptk(p0) + \" will also satisfy\ndTV(h, p) \uf8ff (C + 1)optk(p) + \" as desired.\n4 Lower bounds on agnostic learning\n\nIn this section we establish that \u21b5-agnostic learning with \u21b5 < 2 is information theoretically impos-\nsible, thus establishing Theorem 2.\nFix any 0 < t < 1/2. We de\ufb01ne a probability distribution Dt over a \ufb01nite set of discrete distributions\nover the domain [2N ] = {1, . . . , 2N} as follows. (We assume without loss of generality below that\nt is rational and that tN is an integer.) A draw of pS1,S2,t from Dt is obtained as follows.\n\npS1,S2,t(i) =\n\n1. A set S1 \u21e2 [N ] is chosen uniformly at random from all subsets of [N ] that contain precisely\ntN elements. For i 2 [N ], the distribution pS1,S2,t assigns probability weight as follows:\n2(1 t)\u25c6 if i 2 [N ] \\ S1.\n2. A set S2 \u21e2 [N + 1, . . . , 2N ] is chosen uniformly at random from all subsets of [N +\n1, . . . , 2N ] that contain precisely tN elements. For i 2 [N + 1, . . . , 2N ], the distribution\npS1,S2,t assigns probability weight as follows:\n1\n\n2N \u27131 +\n\nif i 2 S1,\n\npS1,S2,t(i) =\n\n1\n4N\n\n1\n\nt\n\n2N \u27131 \n\nt\n\n2(1 t)\u25c6 if i 2 [N ] \\ S1.\n\npS1,S2,t(i) =\n\n3\n4N\n\nif i 2 S2,\n\nUsing a birthday paradox type argument, we show that no o(pN )-sample algorithm can successfully\ndistinguish between a distribution pS1,S2,t \u21e0 Dt and the uniform distribution over [2N ]. We then\nleverage this indistinguishability to show that any (2 )-semi-agnostic learning algorithm, even\nfor 2-\ufb02at distributions, must use a sample of size \u2326(pN ) (see Appendix B for these proofs):\nTheorem 7. Fix any > 0 and any function f (\u00b7). There is no algorithm A with the following\nproperty: given \" > 0 and access to independent points drawn from an unknown distribution p over\n[2N ], algorithm A makes o(pN ) \u00b7 f (\") draws from p and with probability at least 51/100 outputs\na hypothesis distribution h over [2N ] satisfying kh pk1 \uf8ff (2 )opt2(p) + \".\nAs described in the Introduction, via the obvious correspondence that maps distributions over [N ]\nto distributions over [0, 1), we get the following:\nCorollary 4.1. Fix any > 0 and any function f (\u00b7). There is no algorithm A with the following\nproperty: given \" > 0 and access to independent draws from an unknown distribution p over [0, 1),\nalgorithm A makes f (\") draws from p and with probability at least 51/100 outputs a hypothesis\ndistribution h over [0, 1) satisfying kh pk1 \uf8ff (2 )opt2(p) + \".\n\n2 Note that now there is no guarantee that optC(p) \uf8ff \"; indeed, the point here is that optC(p) may be\n\narbitrary.\n\n8\n\n\fReferences\n[AJOS14]\n\nJ. Acharya, A. Jafarpour, A. Orlitsky, and A.T. Suresh. Near-optimal-sample estimators for spher-\nical gaussian mixtures. Technical Report http://arxiv.org/abs/1402.4746, 19 Feb 2014. A.5\n\n[BBBB72] R.E. Barlow, D.J. Bartholomew, J.M. Bremner, and H.D. Brunk. Statistical Inference under Order\n\n[Bir87]\n\n[Bir97]\n\n[Bru58]\n\nRestrictions. Wiley, New York, 1972. 1, 1\nL. Birg\u00b4e. Estimating a density under order restrictions: Nonasymptotic minimax risk. Annals of\nStatistics, 15(3):995\u20131012, 1987. 1\nL. Birg\u00b4e. Estimation of unimodal densities without smoothness assumptions. Annals of Statistics,\n25(3):970\u2013981, 1997. 1\nH. D. Brunk. On the estimation of parameters restricted by inequalities. Ann. Math. Statist.,\n29(2):pp. 437\u2013454, 1958. 1\n\n[CDSS13] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Learning mixtures of structured distributions\n\nover discrete domains. In SODA, pages 1380\u20131394, 2013. 1\n\n[DDS12]\n\n[CMN98]\n\n[CDSS14] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Ef\ufb01cient density estimation via piecewise\npolynomial approximation. Technical Report http://arxiv.org/abs/1305.3207, conference version\nin STOC, pages 604-613, 2014. 1, 2, 3.1.3, 3.1, 3.3, A.2\nS. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram construction: How\nmuch is enough? In SIGMOD Conference, pages 436\u2013447, 1998. 1\nA. De, I. Diakonikolas, and R. Servedio. Inverse problems in approximate uniform generation.\nAvailable at http://arxiv.org/pdf/1211.1722v1.pdf, 2012. A.5\nL. Devroye and L. Gy\u00a8or\ufb01. Nonparametric Density Estimation: The L1 View. John Wiley & Sons,\n1985. 1\nC. Daskalakis and G. Kamath. Faster and sample near-optimal algorithms for proper learning\nmixtures of gaussians. In COLT, pages 1183\u20131213, 2014. A.5\nL. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer Series in\nStatistics, Springer, 2001. 1, 1, 3, A.5\n\n[DG85]\n\n[DK14]\n\n[DL01]\n\n[GGI+02] A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Fast, small-space\n\nalgorithms for approximate histogram maintenance. In STOC, pages 389\u2013398, 2002. 1\nS. Guha, N. Koudas, and K. Shim. Approximation and streaming algorithms for histogram con-\nstruction problems. ACM Trans. Database Syst., 31(1):396\u2013438, 2006. 1\nU. Grenander. On the theory of mortality measurement. Skand. Aktuarietidskr., 39:125\u2013153, 1956.\n1\nP. Groeneboom. Estimating a monotone density. In Proc. of the Berkeley Conference in Honor of\nJerzy Neyman and Jack Kiefer, pages 539\u2013555, 1985. 1\nD. L. Hanson and G. Pledger. Consistency in concave regression. The Annals of Statistics, 4(6):pp.\n1038\u20131050, 1976. 1\nP. Indyk, R. Levi, and R. Rubinfeld. Approximating and Testing k-Histogram Distributions in\nSub-linear Time. In PODS, pages 15\u201322, 2012. 1\n\n[GKS06]\n\n[Gre56]\n\n[Gro85]\n\n[HP76]\n\n[ILR12]\n\n[MP007]\n\n[Pea95]\n\n[Rao69]\n[Reb05]\n\n[Sco92]\n\n[Sil86]\n[Val84]\n\n[Weg70]\n\n[JKM+98] H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel. Optimal his-\n\ntograms with quality guarantees. In VLDB, pages 275\u2013286, 1998. 1\n\n[KMR+94] M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. Schapire, and L. Sellie. On the learnability of\n\ndiscrete distributions. In Proc. 26th STOC, pages 273\u2013282, 1994. 1\nConcentration inequalities and model selection. Lecture Notes in Mathematics, 33, 2003, Saint-\nFlour, Cantal, 2007. Massart, P. and Picard, J., Springer. 1\nK. Pearson. Contributions to the mathematical theory of evolution. ii. skew variation in homoge-\nneous material. Philosophical Trans. of the Royal Society of London, 186:343\u2013414, 1895. 1\nB.L.S. Prakasa Rao. Estimation of a unimodal density. Sankhya Ser. A, 31:23\u201336, 1969. 1\nL. Reboul. Estimation of a function under shape restrictions. Applications to reliability. Ann.\nStatist., 33(3):1330\u20131356, 2005. 1\nD.W. Scott. Multivariate Density Estimation: Theory, Practice and Visualization. Wiley, New\nYork, 1992. 1\nB. W. Silverman. Density Estimation. Chapman and Hall, London, 1986. 1, 1\nL. G. Valiant. A theory of the learnable. In Proc. 16th Annual ACM Symposium on Theory of\nComputing (STOC), pages 436\u2013445. ACM Press, 1984. 1\nE.J. Wegman. Maximum likelihood estimation of a unimodal density. I. and II. Ann. Math. Statist.,\n41:457\u2013471, 2169\u20132174, 1970. 1\n\n9\n\n\f", "award": [], "sourceid": 981, "authors": [{"given_name": "Siu On", "family_name": "Chan", "institution": "Microsoft Research"}, {"given_name": "Ilias", "family_name": "Diakonikolas", "institution": "University of Edinburgh"}, {"given_name": "Rocco", "family_name": "Servedio", "institution": "Columbia University"}, {"given_name": "Xiaorui", "family_name": "Sun", "institution": "Columbia University"}]}