IO in libSkylark¶
IO from C++¶
Currently, IO from the C++ layer is implemented in libskylark/ml/io.hpp.
Reading¶
- void read(const boost::mpi::communicator& comm, int fileformat, std::string filename, InputType& X, LabelType& Y, int d=0)¶
Reads data from filename as a pair of matrices X and Y. Data are in the form of instances each carrying a label; an instance in turn consists of a series of feature/value pairs. Feature/value pairs are the column indices/entries for matrix X with its row indices coming from a running index of the instances (the minimum number of features is d). Matrix Y carries the vector of successive labels corresponding to this running index. fileformat can have any of the following values of FileFormatType enum (as defined in libskylark/ml/options.hpp):
- HDF5_DENSE
- HDF5_SPARSE
- LIBSVM_DENSE
- LIBSVM_SPARSE
*_DENSE correspond to the case of El::Matrix<T> (dense) matrix X; *_SPARSE is for the skylark::base::sparse_matrix_t<T> (sparse) matrix X. HDF5_* are for Hierarchical Data Format, version 5 (HDF5) files while LIBSVM_* are for Library for Support Vector Machines (LIBSVM) ones.
HDF5-specific routines
- void read_hdf5(const boost::mpi::communicator& comm, std::string filename, El::Matrix<double>& X, El::Matrix<double>& Y, int blocksize=10000)¶
- void read_hdf5(const boost::mpi::communicator& comm, std::string filename, skylark::base::sparse_matrix_t<double>& X, El::Matrix<double>& Y, int min_d=0)¶
Reads datasets X and Y from HDF5 filename as a pair of matrices X, Y. Instances are read in bunches of blocksize each; min_d is the minimum number of features.
LIBSVM-specific routines
- void read_libsvm(const boost::mpi::communicator& comm, std::string filename, El::Matrix<T>& X, El::Matrix<T>& Y, int min_d=0, int blocksize=10000)¶
- void read_libsvm(const boost::mpi::communicator& comm, std::string filename, skylark::base::sparse_matrix_t<T>& X, El::Matrix<T>& Y, int min_d=0)¶
Reads data from LIBSVM filename as a pair of matrices X, Y. Instances are read in bunches of blocksize each; min_d is the minimum number of features.
Utility routines
- void read_hdf5_dataset(H5::H5File& file, std::string name, int* buffer, int offset, int count)¶
- void read_hdf5_dataset(H5::H5File& file, std::string name, double* buffer, int offset, int count)¶
Reads count entries from the name dataset in HDF5 file starting at offset and collects them into buffer.
- void read_model_file(std::string filename, El::Matrix<double>& W)¶
Reads the values from model file filename into matrix W; model file is expected to start with a line of the form: # Dimensions m n for an m x n matrix W followed by lines containing the values of successive rows of the matrix.
- std::string read_header(const boost::mpi::communicator& comm, std::string filename)¶
Returns the first line of filename.
Writing¶
- int write_hdf5(const boost::mpi::communicator& comm, std::string filename, El::Matrix<double>& X, El::Matrix<double>& Y)¶
- int write_hdf5(std::string filename, skylark::base::sparse_matrix_t<double>& X, El::Matrix<double>& Y)¶
Writes feature/value pairs (in matrix X) and labels (in matrix Y) to the HDF5 filename and returns 0 to signal successfully completed operation.
- For the case of dense matrix input X (El::Matrix<double> type) 2 datasets named X and Y are created in the file for all data.
- For the case of sparse matrix X (skylark::base::sparse_matrix_t<double> type) 5 datasets are created in the file for all data: dimensions (for X matrix size), indptr, indices, values (for X matrix) and Y (for Y matrix).
Note
Currently IO from C++ and Python layers are not in sync; we expect a convergence of offerings from these layers as depending libraries (most notably Elemental) freeze their IO facilities and typical/common application IO scenaria are clearly identified.
IO from Python¶
Example¶
#!/usr/bin/env python
# prevent mpi4py from calling MPI_Finalize()
import mpi4py.rc
mpi4py.rc.finalize = False
import h5py
import elem
import numpy
import numpy.linalg
from mpi4py import MPI
import skylark, skylark.io
comm = MPI.COMM_WORLD
# Create a HDF5 file and encapsulate this and its metadata in store
filename = 'mydataset.hdf5'
store = skylark.io.hdf5(filename, dataset='MyDataset')
# Create a 5 x 10 dat matrix and populate its store
if comm.Get_rank() == 0:
m = 8
n = 10
matrix = numpy.array(range(1,81)).reshape(m,n)
store.write(matrix)
# Let all processes wait till the file is created.
comm.barrier()
# All processes read into the file
A = store.read('elemental-dense', distribution='VC_STAR')
# Check the Frobenius norm of the difference of generated/written and read-back matrices
# Gather at root
A_CIRC_CIRC = elem.DistMatrix_d_CIRC_CIRC()
elem.Copy(A, A_CIRC_CIRC)
# Compute the norm at root and output
if comm.rank == 0:
diff_fro_norm = numpy.linalg.norm(A_CIRC_CIRC.Matrix[:] - matrix, ord='fro')
print '||generated_matrix - read_matrix||_F = %f' % diff_fro_norm
Sketch Serialization¶
As mentioned in Section Sketching Transforms sketches can be serialized using the following API:
- boost::property_tree::ptree to_ptree() const¶
Serialize the sketch transform to a ptree structure.
- static sketch_transform_t* from_ptree(const boost::property_tree::ptree& pt)¶
Load a sketch transform from a ptree structure.
The following small snippet shows how a sketch can be serialized and loaded in C++.
//[> 1. Create the sketching matrix and dump JSON <]
skylark::sketch::CWT_t<DistMatrixType, DistMatrixType>
Sparse(n, n_s, context);
// dump to property tree
boost::property_tree::ptree pt = Sparse.get_data()->to_ptree();
//[> 2. Dump the JSON string to file <]
std::ofstream out("sketch.json");
write_json(out, pt);
out.close();
//[> 3. Create a sketch from the JSON file. <]
std::ifstream file;
std::stringstream json;
file.open("sketch.json", std::ios::in);
boost::property_tree::ptree json_tree;
boost::property_tree::read_json(file, json_tree);
skylark::sketch::CWT_t<DistMatrixType, DistMatrixType> tmp(json_tree);
This functionality is also exposed to the Python layer in two ways. The first, is straightforward serialize method of sketch objects, and a corresponding deserialize_sketch on the module level. The following is a short example.
S = sketch.CWT(10, 6)
sketch_dict = S.serialize()
S_clone = sketch.deserialize_sketch(sketch_dict)
Alternatively, sketch objects also support the Pickle and cPickle modules. This allows them to be easily used by external libraries or software like Spark.