Large-scale graph processing is one of many important parts of the Data Infrastructure backend services at Facebook. The need to analyze graphs arises naturally in a wide variety of use cases, including page and grouprecommendations [2], infrastructure optimization through intelligent data placement [3],graph compression [4], and others. The Facebook social graph alone has 1.71 billion vertices with hundreds of billions of edges. Augmenting this graph with more entities, such as the pages that people like, may result in graphs with over 1 trillion edges.
At Facebook, we have built a graph analytics platform based on Apache Giraph, which we described in aprevious blog post [7] and in ourVDLB '15 paper [1]. Since its inception, Giraph has continued to evolve, allowing us to handle Facebook-scale production workloads [1] but also making it easier for users to program [6]. In the meantime, a number of other graph processing engines emerged. For instance, the Spark framework, which has gained adoption as a platform for general data processing, offers a graph-oriented programming model and execution engine as well —GraphX [5].
As our goal is to serve internal workloads in the best possible way, we decided to do a quantitative and qualitative comparison between Giraph and GraphX, since at Facebook we already have a Spark cluster supporting production workloads [11]. Specifically, we are interested in measuring the relative performance and ability of the two systems to handle large graphs. While performance is our main focus, we are also interested in usability and how easy it is to develop applications in the two systems.
Τhe key takeaways from our study are the following:

Performance & scalability

Giraph can currently process at least 50x larger graphs than GraphX.

Even on smaller graphs, Giraph performs better. Apart from this, GraphX exhibits large performance variance.

Giraph is more memory-efficient. The total memory required to run jobs on a fixed graph size is a few times lower for Giraph than for GraphX.

As a consequence, Giraph requires fewer machine hours to process the same graph, making it more efficient overall.

Usability & application development

GraphX allows graphs to be read from Hive using an SQL-like query, allowing arbitrary column transformations. Giraph requires extra programming effort for such preprocessing tasks.

Interacting with data in Spark (e.g., checking the output of a job) is convenient, though mostly when experimenting with small graphs.

Using Scala from the shell environment can be convenient for testing simple GraphX applications, though an IDE is still indispensable when applications become more complex.

Overall, we found that Giraph was better able to handle production-scale workloads, while GraphX offered a number of features that enabled easier development. In the rest of this post, we describe in more detail our methodology and findings.
Methodology

We wanted to evaluate how the two systems perform with different applications, data sets, and cluster setups, and to perform the comparison as fairly as possible.
Different algorithms may have different computation and communication patterns that may exercise various parts of a system in different ways. For instance, message-heavy algorithms typically require more memory and are therefore more sensitive to how efficient memory management is. Here, we use three different algorithms: PageRank, Connected Components, and Triangle Counting. PageRank exhibits a typical computation and communication pattern: The amount of computation and communication is constant across iterations, and they are both O(|E|), where E is the set of the graph edges. Connected Components is similar, though the amount of computation and communication decreases across iterations. These two algorithms are also used in the GraphX paper [5], so they set for a good reference. Triangle Counting, on the other hand, is more message-heavy, as the amount of communication is sum(d_i^2) , where d_i is the degree of vertex v_i . All these algorithms exercise the basic graph computation concepts: storing the graph and sending messages (shuffling data). They all have straightforward implementations in both Giraph and GraphX, allowing us to factor out artifacts of differences in the implementation.
We also used the same data types to represent the graph metadata in both systems, as this affects memory usage. For instance, in the unweighted PageRank implementation in Giraph, we used longs for vertex IDs, nulls for edge values, and doubles for messages and the vertex value. Similarly, in GraphX we created a graph with long vertex IDs and no values, while the PageRank implementation uses doubles to store the PageRank of each vertex.
Additionally, we wanted to measure how different types of graphs may affect the relative performance of the two systems . For instance, certain type of graphs (such as the Twitter graph) have supernodes , vertices with highly skewed degree, while the Facebook friendship graph does not exhibit this property. The presence of such graph artifacts may, for instance, cause more memory pressure on certain machines or skew the compute load. Depending on the design of each system, these artifacts may affect their performance in different ways.
The cluster and system configuration may also affect relative behavior. For instance, for memory-bound applications, we have found that even if we allocate the same total amount of memory, the number of workers in a cluster may affect performance. This is noteworthy, as different cluster setups may implement different resource allocation policies. On the system side, configuration parameters such as the number of compute threads vs. communication threads, number of graph partitions, or even the choice of the Java Garbage Collection mechanism may also affect performance. Given these variables, we fine-tuned each system to ensure maximum performance. We will be describing the details in each of the experiments below.
Experiment setup

We used Spark-1.6.1 and Giraph-1.2.0 for all our experiments, and Java 8u60. Spark runs jobs on cluster by starting “executors” on cluster machines. Executors are worker processes, each running in a separate JVM. Within an executor, GraphX parallelizes processing by running multiple tasks, each within its own thread. Giraph has a similar architecture with slightly different nomenclature. For consistency, we use the term “workers” to refer to the worker processes (i.e., Spark executors), and “threads” to refer to processing threads for each worker (i.e., Spark tasks).
The basic parameters in both systems are the number of workers, the number of processing threads, and the total memory allocated. Even though both systems can run multiple workers on a single physical machine, we found that it is more efficient when we start only one worker process per machine and allocate all the available memory to it. In the following experiments, the number of workers is also the number of physical machines used for each job. We allocated 80 GB of RAM to each worker.
Both systems divide graphs into partitions, and allow threads to process partitions in parallel. In Giraph, the number of various threads is calculated automatically based on the number of cores allocated to workers, and the number of partitions is then set in a way that allows for good utilization of these threads. In GraphX, after trying different numbers of partitions, we found that 8 partitions per worker is optimal, even though the machines we used have 20 cores. Both systems allow different graph partitioning methods. In GraphX, we used the EdgePartition2D strategy, as it was 2.5x faster than the default strategy. Giraph does not support such an optimization, so we simply used hash partitioning on the vertex IDs.
The choice of GC mechanism also mattered. Both systems performed better with Parallel GC as the JVM garbage collection mechanism. On top of this, GraphX exhibited larger performance variance when using CMS or G1 as the GC mechanism. In all experiments below, we used Parallel GC.
We used the following configuration for Spark:
[code]spark.dynamicAllocation.initialExecutors=1/2/4/8/16
spark.dynamicAllocation.minExecutors=1/2/4/8/16
spark.dynamicAllocation.maxExecutors=1/2/4/8/16
spark.executor.memory=80g
spark.executor.cores=20
spark.shuffle.manager=hash
spark.shuffle.service.enabled=false
spark.executor.extraJavaOptions="-XX:+UseParallelGC -XX:+UseParallelOldGC"[/code] We disabled the Spark external shuffle service to avoid writing intermediate shuffle data to disk, which improved performance. We also configured our underlying resource manager to run a single executor per machine. If you are using YARN, you can achieve this by setting the following parameter:
[code]yarn.scheduler.maximum-allocation-mb=YYY[/code] For each GraphX application, we prepared the graph by setting the number of partitions and the partitioning method:
[code]import org.apache.spark.graphx._
val relationships = hql_query_to_read_data.
map(row => Edge(row.getLong(0), row.getLong(1), null)).
repartition(numExecutors * 8)
val graph = Graph.fromEdges(relationships, 0).
partitionBy(PartitionStrategy.EdgePartition2D)[/code] For GraphX, the times we are reporting include neither data loading nor this graph preparation. Similarly, for Giraph, we measure time from the moment the graph is loaded.
On the Giraph side, all configuration parameters are encapsulated in aspecial Java class that can be selectively enabled in the command line.
Performance & scalability

We wanted to evaluate the relative performance and scalability of the two systems. We started by trying to reproduce the results from theGraphX paper [5] using the same two publicly available graph data sets: the Twitter graph, which has 1.5 billion edges, and the UK web graph, which has 3.7 billion edges.
Our initial runs on the Twitter graph with the default GraphX configuration were more than 10x slower than what was reported in the paper. After careful tuning, our experiments exhibited performance closer to that observed in the paper.
In our first experiment, we ran 20 iterations of the PageRank algorithm and Connected Components algorithm on both graphs, varying the number of workers. The following figures show the running time for both systems. Note that in all the experiments, we exclude the time to load the input data in-memory, as in both systems the components that read input data are internal implementations. Here, we report only the time for the actual processing. In addition, performance on GraphX varied significantly across runs. The running times reported are the minimum we observed, not the average.

Figure 1. Running time of PageRank on the Twitter graph (left) and the UK Web graph (right) as the number of machines varies.

Figure 2. Running time of Connected Components on the Twitter graph (left) and the UK Web graph (right) as the number of machines varies. Giraph runs PageRank on the Twitter graph 4.5 times faster than GraphX with 16 workers, and four times faster with eight workers. In both systems, the running time changes as expected with respect to the number of workers (i.e., close to 1/n). But when the machines start getting low on memory, performance drops dramatically. This happens at a different number of machines for Giraph and GraphX. For instance, the PageRank running time on GraphX rises from six minutes to 41 minutes going from eight workers to four workers, while Giraph exhibits only a small increase — from 1.5 minutes to three minutes. The Giraph running time increases more sharply when we go from two machines to one, but still not that dramatically. In the case of the UK web graph, the performance difference is larger, ranging from Giraph running 5.8 times faster than GraphX with eight machines to five times faster with 16 machines. GraphX was not able to process the UK graph with four machines. The results are similar for the Connected Components algorithm.
The difference in performance is even wider if we consider efficiency in terms of the total machine time required to process a graph. GraphX requires 34 machine minutes (eight machines x 255 seconds) to find the Connected Components of the Twitter graph, while with Giraph we can process the graph in six machine minutes (two machines x 186 seconds), requiring 5.6x less machine time. On the UK web graph, Giraph requires 9x less machine time.
We also ran the more message-heavy Triangle Counting algorithm in the two systems. With Giraph, we needed 32 machines to run Triangle Counting on the UK graph, which took about 40 minutes, while GraphX could not process the UK graph with the same number of machines. Running Triangle Counting on the Twitter graph took multiple hours on Giraph and failed on GraphX, so we omit any results.
After tuning the two systems and getting GraphX to perform close to what was reported in the GraphX paper, we wanted to evaluate performance on other graphs as well. In particular, we wanted to evaluate the two systems on social graphs that do not have supernodes. For this experiment, we used Darwini [8, 9], an open source synthetic graph generator tool we developed at Facebook. We used Darwini to produce synthetic graphs [10] with properties (e.g. degree and clustering coefficient distributions) similar to those of social networks that cap the number of friends a person can have. All the synthetic graphs in these experiments were created using this tool.
We started by generating a graph with 2 billion edges, the largest graph on which we could run all three algorithms with GraphX using 16 machines. Figure 3 shows the results.