Sketching-based Matrix Computations for Machine Learning

Making big data small

Download .zip Download .tar.gz View on GitHub
Doxygen Sphinx (stable) Sphinx (dev)

libSkylark

This library for Sketching-based Matrix Computations for Machine Learning, known informally as libSkylark, is suitable for general statistical data analysis and optimization applications.

Many tasks in machine learning and statistics ultimately end up being problems involving matrices: whether you're finding the key players in the bitcoin market, or inferring where tweets came from, or figuring out what's in sewage, you'll want to have a toolkit for least-squares and robust regression, eigenvector analysis, non-negative matrix factorization, and other matrix computations.

Sketching is a way to compress matrices that preserves key matrix properties; it can be used to speed up many matrix computations. Sketching takes a given matrix A and produces a sketch matrix B that has fewer rows and/or columns than A. For a good sketch B, if we solve a problem with input B, the solution will also be pretty good for input A. For some problems, sketches can also be used to get faster ways to find high-precision solutions to the original problem. In other cases, sketches can be used to summarize the data by identifying the most important rows or columns.

A simple example of sketching is just sampling the rows (and/or columns) of the matrix, where each row (and/or column) is equally likely to be sampled. This uniform sampling is quick and easy, but doesn't always yield good sketches; however, there are sophisticated sampling methods that do yield good sketches.

The goal of this project is to build a sketching-based open-source software stack for NLA and its applications, as shown:

Matrix Completion Nonlinear RLS,
SVM, PCA
Robust Regression Other applications
Python: Python-based data analytics scripting layer
PythonBinding: C++ to Python bindings
NLA: Numerical Linear Algebra primitives
(Least squares regression, low-rank approximation, randomized estimators)
Sketch: Sketching kernels
JL, FJL, Gaussian, Sign, Sparse Embedding
Third-Party Libraries:
MPI, Elemental, BLAS, CombBLAS, FFTW, Boost

Please note that the code is in "beta" mode. We are constantly testing and hardening the code, and improving the documentation.

Quick Install Guide

Vagrant and VirtualBox provide an easy way to get started with libSkylark. Follow the steps below to get a libSkylark environment running on a virtual machine. The following assumes Vagrant and VirtualBox are installed.


        git clone https://github.com/xdata-skylark/libskylark.git
        cd libskylark/vagrant/precise64
        vagrant up
        vagrant ssh
        

Note that if you want to test the current development version use the recipe below.


        git clone -b development https://github.com/xdata-skylark/libskylark.git
        cd libskylark/vagrant/trusty64
        vagrant up
        vagrant ssh
        
More details are provided in the quick start section.

Building from source

libSkylark is relying on Cmake as a build system. Before you start please make sure to check out the Installing software dependencies for libSkylark section to learn about the required dependencies.

In many situations the default configuration and settings should work out of the box. To that end execute


        cd $BUILD_DIR
        CC=mpicc CXX=mpicxx cmake $SRC_DIR
        make
        make install
        

Note: If you have MPI compilers in your PATH environment variable, Cmake may overwrite the compiler specified in the CXX flag.

In case you have a more specific setup or this does not work on your machine continue reading the installation documentation.

Examples