Abstract:
With the advent of big data, graphs have gained popularity as one of the most efficient data storage mechanisms. A graph can not only capture relationships between entities, but it can also store attributes associated with entities in the form of attributed nodes. This makes graphs quite a versatile data structure. Attributed network embedding refers to the task of representing each node of a graph as a low-dimensional vector so that it captures its neighborhood associations and attribute information. A downstream ma- chine learning algorithm can use such an embedding to perform node classification, link prediction, and community detection tasks. Several learning-based methods were recently proposed that can produce high utility embeddings, but they scale poorly in terms of embedding space and embedding time with respect to network size, and stutter for massive billion-scale networks. Our study addresses this problem by introducing BGENA (Binary-embedding GENer- ator for Attributed graphs), which uses a recently proposed fast and utility-preserving sketching method BinSketch along with a novel edge propagation mechanism to gen- erate binary embeddings of each node. BGENA is designed to preserve any arbitrary order of proximity of nodes within its embedding. As a result of using only fast bitwise operations for the entire embedding process, BGENA achieves anywhere between 10× to 100× speedup compared to some existing methods. BGENA’s binary embeddings allow for efficient bit-array/sparse-matrix representations to save space, making it four to eight times better in terms of the system’s memory requirement. We also propose its parallelized version named PBGENA (Parallelized BGENA), which uses MPI to lever- age the multi-core architecture of a system to further accelerate the embedding speed to nearly 16× over BGENA. PBGENA produced embedding results for all our graphs with 20,000 or fewer nodes in less than a second using an AMD 32-Core 3.2GHz server, and it did the job for TWeibo, a graph with over 2 million nodes and 50 million edges, in less than two minutes.Further, BGENA is the only method known to us that was able to embed MAKG, a graph with nearly 60 million nodes and a billion edges, within the 270GB memory cap of the system in just 8 hours with comparable accuracy. We evaluate PBGENA embeddings on tasks like node classification, link prediction, and graph visualization with several real-world networks of varied sizes, and outperform the state-of-the-art baselines in performance, often by large margins and at a fraction of the time. Our experiments found that specific embedding methods prefer particular graphs where the results are in the top echelon but underperform significantly for other graphs. However, after hyperparameter tuning, no such effects were observed for PBGENA. All of these make PBGENA a robust, high-utility, cost-effective, and low space budget embedding method.