Shape Fitting on Uncertain Data

Jeff M. Phillips (with Maarten Loffler - Utrecht University)

For many large data sets, there is an error model associated with each data point. For instance, in GIS data and robotic mapping data each point might be modeled by a Gaussian distribution. In many public data sets used in data mining, certain traits have been discretized or randomly perturbed to retain anonymity. Despite this, most algorithms used on these data sets assume the error does not significantly affect the outcome, or even worse, that the original data are precise. This can lead to erroneous results. Other models truncate the error distributions at a fixed threshold, treating them as shapes (e.g., balls) in which the data point could lie. Then they perform combinatorial analysis on the set of shapes to give upper and lower bounds on statistics (e.g., radius of the smallest enclosing ball). This approach suffers from the choice of a fixed threshold to bound the error distributions and gives only thresholds on answers.

A better approach is to represent the answers themselves as probability distributions. For instance, for a set of points, each with error modeled by a probability distribution, the radius of the smallest enclosing ball should also be modeled as a probability distribution. Under this model, we developed data structures to conveniently represent answers approximately as probability distributions as well as simple and practical algorithms to construct these answers. More importantly, we bound the error of the approximation with respect to the error models of the data, not just the raw data or an ad hoc approximation of it. (See a preprint here.)

Additionally, we reduce the size of the probability distribution representing the answer without losing any accuracy by improving on a classic data structure called an eps-approximation (related to eps-nets and range spaces). (See paper here.) In many reasonable settings, including the one used for representing the answers to the above problems, we reduce the size from O((1/eps^2) log (1/eps)) (which is near-optimal in the general case) to nearly a square root of that size: O((1/ eps) log^c (1/eps)) for a small constant c.