“How cute! So many arrows and lines!” – Anyone who never had to traverse a large graph.
Finally, I will be able to quote myself, yey!
Graphs, in computer science, are an abstract data type that implements the graph and hypergraph concepts from mathematics. This data structure has a finite – and possible mutable – set of ordered pairs consisting of edges (or arcs), and nodes (or vertices). An edge can be described mathematically as edge(x,y) which is said to go from x to y. There may be some data structures associated with edges, called edge value. This structure can represent numeric attributes (for example costs) or just a symbolic label. Also there is the important definition of adjacent(G, x, y), that is a function that tests for a given graph data structure G if the edge(x,y) exists.”
As mentioned by (Lotz, 2013)
There are many flavours of graphs, each one of them take into account some detail of the graph structure. In example, an undirected graph is a graph in which the edges do not take into account the orientation, as opposed to the direct graphs in which they take.
Here’s a quick example of a directed graph:
Example of a directed graph (Wolfram Alpha)
Large Scale Graph Processing Problem
“Houston, we have a problem” – The same person, after having to compute a large Graph.
In the big data analysis, graphs are usually known as hard to compute. (Malewicz & all, 2010) –
(not my quote this time) points that the main factors that contribute to these difficulties are:
- Poor locality of memory access.
- Very little work per vertex.
- Changing degree of parallelism.
- Running over many machines makes the problem worse (i.e. how will I make a machine communicate with another without generating I/O bottleneck?)
Large Scale Graph Pre-Pregel Solutions
Before the introduction of Pregel (Malewicz & all, 2010) – yep, the same guy that I quoted above… that’s because he knows all in this field – the state-of-art solutions to this problem were:
- The implementation of a customised framework: that demands lots of engineering effort. It’s like: “Oh, I have a large graph problem that will demand me a Big Data approach. Since I have nothing else to do, I will implement my own framework for the problem. It’s ok if it only takes 10 years and I be really problem specific.”
- Implementation of a graph analysis algorithm using the MapReduce platform: this proves to be inefficient, since it has to store too much data after each MapReduce job and requires too much communication between the stages (which may cause excessive I/O traffic). Aside from that, it is quite hard to implement graph algorithms, since there is no option to make calculations node/vertex-oriented on the framework.
In order to completely solve the I/O problem, one may use a single computer graph solution – you said a super computer? – This would, however, require not commodity hardware and would not be scalable. The existing distributed graph systems that existed before the Pregel introduction were not fault tolerant, which may cause the loss of huge amount of work in the case of any failure.
and that’s why my next post will be about Pregel, a Large Graph processing framework developed by Google.