In topological data analysis, a point cloud data $P$ extracted from a metric space is often analyzed by computing the persistence diagram or barcodes of a sequence of Rips complexes built on $P$ indexed by a scale parameter. Unfortunately, even for input of moderate size, the size of the Rips complex may become prohibitively large as the scale parameter increases. Starting with the \emph{Sparse Rips filtration} introduced by Sheehy, some existing methods aim to reduce the size of the complex so as to improve the time efficiency as well. However, as we demonstrate, existing approaches still fall short of scaling well, especially for high dimensional data. In this paper, we investigate the advantages and limitations of existing approaches. Based on insights gained from the experiments, we propose an efficient new algorithm, called \emph{SimBa}, for approximating the persistent homology of Rips filtrations with quality guarantees. Our new algorithm leverages a batch collapse strategy as well as a new sparse Rips-like filtration. We experiment on a variety of low and high dimensional data sets. We show that our strategy presents a significant size reduction, and our algorithm for approximating Rips filtration persistence is order of magnitude faster than existing methods in practice.

Although many fast methods exist for constructing kNN-graph for low dimensional data, it is still an open question how to do it efficiently for high dimensional data. We present a new method to construct an approximate kNN-graph for medium to high dimensional data. Our method uses one dimensional mapping with Z-order curve to construct an initial graph and then continues to improve this using neighborhood propagation. Experiments show that the method is faster than the compared methods with four different benchmark data sets, the dimensionality of which ranges from 14 to 544. We also show that errors in the approximate kNN-graph originate more likely from outlier points; and it can be detected during run time which points are likely to have errors in their neighbors.

We present KADABRA, a new algorithm to approximate betweenness centrality in directed and undirected graphs, which significantly outperforms all previous approaches on real-world complex networks. The efficiency of the new algorithm relies on two new theoretical contribution, of independent interest. The first contribution focuses on sampling shortest paths, a subroutine used by most algorithms that approximate betweenness centrality. We show that, on realistic random graph models, we can perform this task in time $|E|^{\frac{1}{2}+o(1)}$ with high probability, obtaining a significant speedup with respect to the $\Theta(|E|)$ worst-case performance. We experimentally show that this new technique achieves similar speedups on real-world complex networks, as well. The second contribution is a new rigorous application of the adaptive sampling technique. This approach decreases the total number of shortest paths that need to be sampled to compute all betweenness centralities with a given absolute error, and it also handles more general problems, such as computing the $k$ most central nodes. Furthermore, our analysis is general, and it might be extended to other settings.

It is well-known that Quicksort - which is commonly considered as one of the fastest in-place sorting algorithms - suffers in an essential way from branch mispredictions. We present a novel approach to address this problem by partially decoupling control from data flow: in order to perform the partitioning, we split the input in blocks of constant size; then, all elements in one block are compared with the pivot and the outcomes of the comparisons are stored in a buffer. In a second pass, the respective elements are rearranged. By doing so, we avoid conditional branches based on outcomes of comparisons (except for the final Insertionsort). Moreover, we prove that when sorting $n$ elements the average total number of branch mispredictions is at most $\epsilon n \log n + \Oh(n)$ for some small $\epsilon$ depending on the block size. Our experimental results are promising: when sorting random integer data, we achieve an increase in speed (number of elements sorted per second) of more than 80% over the GCC implementation of Quicksort (C++ std::sort). Also for many other data types and non-random inputs, there is still a significant speedup. Only in few special cases like (almost) sorted inputs, std::sort can beat our implementation. Moreover, on random input permutations, our implementation is even slightly faster than Super Scalar Sample Sort, which uses a linear amount of additional space. Finally, we also apply our approach to Quickselect and we obtain a speed-up of more than 100% over the GCC implementation (C++ std::nth_element).