Local and Distributed Dynamics for Community Detection

Luca Becchetti

Given an underlying graph, we consider distributed dynamics that globally perform detection of planted communities. In the local communication model, we consider the following dynamics: Initially, each node locally chooses a value in $\{-1,1\}$, uniformly at random and independently of other nodes. Then, in each consecutive round, every node updates its local value to the average of the values held by its neighbors, at the same time applying an elementary, local clustering rule that only depends on the current and the previous values held by the node.  

We prove that the process resulting from this dynamics produces a clustering that exactly or approximately (depending on the graph) reflects the underlying cut in logarithmic time, under various graph models that exhibit a sparse balanced cut, including the stochastic block model.    

We also prove that a natural extension of this dynamics performs community detection on a regularized version of the stochastic block model with multiple communities.

We further discuss more recent extensions of this work to different models of communication (e.g., population protocols) and more to the case of more communities.

Locality in Data Structures

Ioana Bercea

We will talk about ways in which locality is employed in the design of data structures. We will focus on dictionaries, which maintain sets under insertions and deletions and support membership queries of the form “is an element x in the set?”. We will discuss how state-of-the-art techniques for dictionaries exploit locality in the word RAM model to obtain space and time gains and also how locality is exploited in the choice of hash functions employed for the design.

Multi-level Locality-sensitive hashing for sub-quadratic time DBSCAN clustering

Camilla Birch Okkels

DBSCAN is a density-based clustering algorithm used for finding clusters, detecting outliers, and in many other applications. Two user parameters minPts and ε define thresholds for the minimum density of neighboring points at a certain distance to consider a point as a core point. Clusters are found by density-reachable chains for core points. Finding a clustering in high-dimensional data scales quadratically with the data set size.

This talk will outline an algorithm that takes advantage of the partitioning of locality-sensitive hashing (LSH) in a multi-level LSH-based data structure. We will show how this data structure can significantly reduce the number of distance calculations required in DBSCAN, thereby improving its overall running time. Our objective is to prove a sub-quadratic running time when specific conditions exist in the data set.

Additionally, we want to compare our algorithm with other LSH-based approaches from the literature. We will show that our algorithm theoretically and in practice can be used efficiently for very large and high dimensional data sets in contrast to the traditional DBSCAN algorithm.

From Local to Global: A Beyond Worst-Case Analysis of the Louvain Algorithm for Graph Clustering.

Vincent Cohen-Addad

A classic problem in data analysis consists in partitioning the vertices of a network in such a way that vertices in the same set are densely connected. In practice, the most popular approaches rely on local search algorithms; not only for the ease of implementation and the efficiency, but also because of the accuracy of these methods on many real-world graphs. For example, the Louvain algorithm -- a local search based algorithm -- has quickly become the method of choice for detecting communities in social networks, accumulating more than 8600 citations over the past 10 years.  However, explaining the success of these methods remains an open problem: in the worst-case, their performances can be arbitrarily bad.  

In this work, we study these local search heuristics and aim at explaining their success and identifying the mechanisms that make them successful through a classic model for social networks, the stochastic block model. We give the first theoretical evidence that Louvain finds the underlying natural partition of the network.  

Joint work with Adrian Kosowski, Frederik Mallmann-Trenn and David Saulpic.

Streaming Facility Location in High Dimension

Artur Czumaj

In this talk we consider Euclidean Uniform Facility Location in the classical setting of dynamic geometric streams, and we focus on the high-dimensional regime, where the algorithm’s space complexity must be polynomial (and certainly not exponential) in d log(n).

We present a new algorithmic framework, based on importance sampling from the stream, for O(1)-approximation of the optimal cost using only poly(d · log n) space. This framework is easy to implement in two passes and over random-order streams. We show how to apply our framework to arbitrary-order streams, where we compute O(d/ log d)-approximation in one pass by using the new framework but combining the two passes differently. This improves upon previous algorithms (with space polynomial in d) which guarantee only O(d·log^2 n)-approximation, and therefore our algorithms are the first to avoid the O(log n)-factor in approximation that is inherent to the widely-used quadtree decomposition. Our improvement is achieved by developing a novel framework of geometric hashing scheme that maps points in R^d into buckets of bounded diameter, with the key property that every point set of small-enough diameter is hashed into only a few buckets.

Joint work with Arnold Filtser, Shaofeng H.-C. Jiang, Robert Krauthgamer, Pavel Vesely, and Mingwei Yang

Higher degree sum-of-squares relaxations robust against oblivious outliers

Tommaso D'Orsi

We consider estimation models of the form $Y=X^*+N$, where $X^*$ is  some $m$-dimensional signal we wish to recover, and $N$  is symmetrically distributed noise that may be unbounded in all but a small $\alpha$ fraction of the entries. We introduce a family of algorithms that under mild assumptions recover the signal $X^*$ in all estimation problems for which there exists a sum-of-squares algorithm that succeeds in recovering the signal $X^*$ when the noise $N$ is Gaussian. This essentially shows that it is enough to design a sum-of-squares algorithm for an estimation problem with Gaussian noise in order to get the algorithm that works with the symmetric noise model. Our framework extends far beyond previous results on symmetric noise models and is even robust to adversarial perturbations. 

In our proofs we use bounds on the covering numbers of sets of pseudo-expectations, which we obtain by certifying in sum-of-squares upper bounds on the Gaussian complexities of sets of solutions.

Privacy in clustering

Alessandro Epasto

Clustering is a fundamental unsupervised machine learning problem that lies at the core of several real-world applications. While traditional clustering algorithms have not considered the privacy of the users providing the data, recently private clustering has received significant attention. In this talk I will cover recent research in clustering with differential privacy, a strong notion of privacy guarantee promising plausible deniability for user data. I will mostly cover work on clustering graph data, and later briefly discuss work on clustering in metric spaces. For graph clustering, I will focus on our recent work (ICML 2023) where we show edge-differentially private hierarchical clustering algorithms with provable approximation guarantees.

The reconciliation k-median problem

Aristides Gionis

In the reconciliation k-median problem we ask to cluster a set of data points by picking k cluster centers so as to minimize the sum of distances of the data points to their cluster centers plus the sum of pairwise distances between the centers. The problem, which is a variant of classic k-median, aims to find a set of cluster centers that are not too far from each other, and it has applications, or example, when selecting a committee to deliberate on a controversial topic. In this talk, we demonstrate a close connection of reconciliation k-median to a variant of the k-facility location problem, in which each potential cluster center has an individual opening cost and we aim at minimizing the sum of client-center distances and the opening costs. This connection enables us to provide a new algorithm for reconciliation k-median that yields a constant-factor approximation. We also discuss a sparsification scheme that reduces the number of potential cluster centers to O(k) in order to substantially speed up approximation algorithms.

Dynamic k-center through dynamic approximate furthest neighbor

Mordecai Golin

For a set of points P, the k-center problem is to find a minimum radius and associated center set of k points C such that the distance from each point in P to its closest center is at most . While NP-hard to solve, there exist easy 2-approximation algorithms for general metrics and approximation algorithms for the Euclidean version. Many of these approximation algorithms use furthest neighbor queries as subroutines. Given a second point set C’, a furthest neighbor query finds a point p in P that is a furthest neighbor from C’.

The dynamic version of the k-center problem maintains P over insertions and deletions of points, in a way that permits efficiently solving the approximate k-center problem for the current P. This has been extensively studied,. By contrast, the related problem of maintaining P to permit efficiently solving approximate furthest neighbor queries does not seem to have been previously studied.

We show that, for points in bounded doubling dimension, the approximate furthest neighbor problem can be efficiently solved using navigating nets. Plugging this as a subroutine into known static k-center approximation algorithms (replacing their furthest neighbor queries with approximate furthest neighbor ones) yields good new approximation algorithms for dynamic k-center. Unlike some of the older algorithms, this new approach does not require knowing k or in advance. This new approach also yields what seems to be the first non-randomized algorithm for dynamic Euclidean k-center.

Clustering with same-cluster queries

Silvio Lattanzi

In this talk we focus on the semi-supervised active clustering problem where we study the cluster recovery problem. Given a finite set of input points, and an oracle revealing whether any two points lie in the same cluster, our goal is to recover all clusters exactly using as few queries as possible. To this end, we first relax the spherical $k$-means cluster assumption of Ashtiani et al.\ to allow for arbitrary ellipsoidal clusters with margin. We show that, even in this much more general setting, it is still possible to recover the latent clustering exactly using a number of queries that scales only logarithmically with the number of input points. We finally also consider the more challenging non-convex setting where we provide additional results.

Coreset strategies for dynamic k-center in doubling metrics

Paolo Pellizzoni

The k-center problem is a commonly used clustering task where, given a set of points from a metric space, one seeks to select, from the given pointset, k centers such that the maximum distance from any point to its closest center is minimized. In many practical applications, the pointset is often dynamic, with new points being added and some becoming stale, and developing efficient algorithms for dealing with such dynamic pointsets is thus of paramount importance. 

We present various algorithms for k-center clustering and some closely related problems in two dynamic frameworks, namely the sliding window and the fully dynamic one. Our algorithms leverage the properties of metric spaces with bounded doubling dimension to maintain, at any point in time, a coreset (i.e., a small subset) of the set of active points, from which a provably good approximate solution can be efficiently extracted.

Discovering Temporal Motif Densest Subnetworks

Ilie Sarpe

A temporal network is a powerful network model enabling the representation of fine-grained temporal information over the edges of a network. Analysing temporal networks through the lens of novel data-mining primitives is an important task to better understand such complicated systems, but it is algorithmically challenging. One of the most important primitives for temporal network analysis is the study of temporal motifs, i.e., small subnetworks occurring in a short amount of time with a prescribed ordering over the various edges of the motif.  

In this talk we will introduce a novel problem requiring the computation of the densest subnetwork of a large temporal network, where density is defined with respect to the temporal motif counts in each subnetwork. We will then present novel scalable and efficient algorithms providing rigorous guarantees on their output to tackle such novel problem, and show how such algorithms perform in practice.

Using Quadtree for Clustering

David Saulpic

The quadtree decomposition of Arora is one of the staple of approximation algorithm in Euclidean space. It is the main tool for several fundamental result, such as an optimal approximation scheme for low-dimensional TSP.

For k-median and k-means clustering, this technique has been used in various setting, e.g., differential privacy, streaming or to design fast algorithms. There is however one big caveat: while a quadtree decomposition preserves distances in expectation, it does not preserve the square of distances -- which is needed for k-means!

In this presentation, I will focus on low dimensional Euclidean space, the most basic setting to study quadtree. I will present some recent developments, that allow to use this technique even for squared distances, and almost optimally: more precisely, I will show how to compute a (1+eps)-approximation to k-means in running time 2^{\tilde O(1/eps^{d-1})} n, almost matching a Gap-ETH lower-bound.

On the adversarial robustness of Locality-Sensitive Hashing in  Hamming space

Christian Sohler

Locality-sensitive hashing is a classical randomized data structure for approximate nearest neighbor search. The provided probabilistic guarantees are usually against an oblivious adversary. However, in many applications of nearest neighbor search the queries are chosen adaptive. 

We study the robustness of the locality-sensitive hashing to adaptive queries in Hamming space. We present a simple adversary that can, under mild assumptions on the initial point set, provably find a query to the approximate near neighbor search data structure that the data structure fails significantly faster than using non-adaptive queries. 

Fully Dynamic k-Center Clustering with Outliers

Mauro Sozio

In this talk, we consider the robust version of the classic k-center clustering problem in general metric spaces: we wish to remove up to z points (outliers), so as to be able to cluster the remaining points in k clusters with minimum maximum radius. We study such a problem under the fully dynamic adversarial model, where points can be inserted or deleted arbitrarily. In this setting, the main goal is to design algorithms that maintain a high quality solution at any point in time, while requiring a “small” amortized cost, i.e. a “small” number of operations per insertion or deletion, on average. Our work provides the first constant bi-criteria approximation algorithm for such a problem with its amortized cost being independent of both z and the size of the current input. We also complement our positive result with a lower bound showing that any constant (non bi-criteria) approximation algorithm has amortized cost at least linear in z. Finally, we discuss an in-depth experimental analysis of our algorithm on Twitter, Flickr, and Air-Quality datasets showing the effectiveness of our approach.

Optimal Coresets for Euclidean k-Means

Chris Schwiegelshohn

Coresets are a type of summaries that allow us to query the cost wrt any candidate queries. If the queries happen to be any set of  k-centers and the costs are the squared Euclidean distances between points and their closest centers, we are studying coresets for Euclidean k-means, arguably the most important and widely researched coreset question. Following a series of works, we now know that a coreset of size $O(k/\varepsilon^2  \cdot\min(\sqrt(k),1/\varepsilon^2))$ exists and this has recently been shown to be optimal. In this talk we will outline how the construction works and what  the main insights are towards proving this result.

Scalable Approximation of Internal Measures for Clustering Evaluation

Fabio Vandin

An important step in cluster analysis is the evaluation of the quality of a given clustering through structural measures of goodness. Measures that do not require additional information for their evaluation (but the clustering itself), called internal measures, are commonly used because of their generality. The most widely used internal measure is the silhouette coefficient, whose naive computation requires a quadratic number of distance calculations, unfeasible for massive datasets. Surprisingly, there are no known general methods to efficiently approximate the silhouette coefficient of a clustering with rigorously provable high accuracy. In this work, we present scalable algorithms to compute such a rigorous approximation for the evaluation of clusterings based on any metric distances.

Robustness and dynamicity in submodular maximization

Morteza Zadimoghaddam

Submodular maximization has been well studied as an algorithmic approach in solving exemplar-based clustering problems. We study this optimization problem with different feasibility constraints in dynamic settings where elements can be inserted or deleted arbitrarily. The challenge is to maintain data structures that require minimal computational resources per update. We start by designing robust algorithmic ideas in the presence of an adversary that deletes a portion of the dataset in batch mode.