Introduction to Property Graph, Apache TinkerPop and Gremlin

Brief introduction to Property Graphs, Apache TinkerPop Gremlin and Graph Databases & Analytic Systems for contextual applications.

Before diving into Gremlin queries, I want to cover the basics of graphs, why use them and what Apache TinkerPop and Gremlin even are.

Let’s start with the basics.

A graph is a network of vertices (nodes) and edges (links). Graphs are intuitive to us because they translate real world concepts like people, places, things and the relationships between them well into a data structure. Each vertex on a graph is a thing and each edge is the relationship between two things.

Rather than going on and on about graphs, I will jump into an example straight away. I can model myself and this post (the two things) as a graph. Each thing or vertex has one or more properties that describe it. I have a name and a set of skills while the post has a title and a topic. The relationship between things can also have properties like the date this link was created.

In 1000 words, it looks like this:

What I described above is known as a property graph. A property graph or more specifically a labeled property graph is a graph with vertices, directed edges, labels that identify the type of a vertex or an edge, and properties that describe them. In the picture above, the left vertex has a “person” label, the right vertex has a “post” label and the edge has a “wrote” label. They also have an ID to uniquely identify them on the graph and properties made of key:value pairs. In reality, the graph would be more complex with many edges and more vertices like comments and user accounts.

Once we have established a basic understanding of graphs, we can move on to understanding why and when to use them. As I wrote above, graphs model things and relationships between things very well which means they are a suitable structure for data that is highly connected. In the current landscape, we have user data, behavioural data and data from sensors & IoT devices and there is growing demand for contextual applications. A graph structure allows us to model and store this data in a way that makes it efficient to find answers.

If I carry on with the example above, once I have found myself on the graph, I also have references to the posts I have written. In a Relational Database, I would have to join these tables which involves lookups in both tables. This is not a problem with current databases because they apply efficient indexing and query optimisation techniques. However, the complexity of data is increasing as we capture ever increasing number of user interactions across multiple systems; storing data with graph structures will have an advantage for certain applications.

Things in a graph structure inherently store references to other things which means I can fetch all interactions of a user without performing lookups. Essentially, it becomes a constant time operation.

Besides the inherent benefits, graphs are also excellent data structures for fraud detection, social networking and recommendation systems use cases. For such applications, Graph databases have clearly displayed an advantage. We will get into graph databases in an upcoming post.

Before moving onto Apache TinkerPop and Gremlin, it’s important to note that Property Graph is not the only graph model. There is also RDF or Resource Description Framework. I will not delve into the differences of these models in this post but I would say that property graphs are more intuitive as they closely match the way we thing and understand the world.

Apache TinkerPop is a graph computing framework that builds on the Property Graph model. It is an abstract layer over the model with a set of APIs that make it agnostic of how the graphs are stored or processed. A “TinkerPop-enabled” graph database or analytic system implements these APIs allowing us to run the same queries across different providers much like SQL. Apache TinkerPop has APIs for users to write graph queries or traversals and APIs for Graph database or analytic system providers.

Gremlin is the graph traversal language of Apache TinkerPop. It is the SQL equivalent to Graph systems but unlike SQL, it is not fundamentally different from programming languages. Gremlin can be looked at as both a query language and a programming language.
It can be written in any language as long as the language supports function composition and nesting. This is made possible by the Gremlin Traversal Machine (GTM) which is the virtual machine that processes traversals written in Gremlin.

I recommend the official Apache TinkerPop website as a good starting point to get familiar with the framework and the Gremlin traversal language. Now, without further ado, let’s look at how to get started with writing Gremlin.

The easiest way to get started is with the Gremlin console. It is a REPL like environment that makes it easy to run Gremlin queries, perform ad-hoc analyses and comes with an in-memory graph database called TinkerGraph. You can download the gremlin console here and run /bin/gremlin.sh to start the console. However, I find it easier to manage versions and installations with Docker.

You can download and install docker desktop from here. Once docker is installed, run the following to start a gremlin console container:

docker run -it tinkerpop/gremlin-console

Once started, your terminal should look like this:

Here, you can type any valid gremlin query or as a matter of fact, any valid groovy code as it is based on Groovy Shell.

Gremlin console comes with a built-in example graph that we can create using graph = TinkerFactory.createModern()

This is the graph it creates:

Source: Apache TinkerPop

The integers on the vertices and edges are the IDs and the words are the labels. The additional white boxes contain the properties of each element. Person vertices have a name & an age and edges have a weight indicating strength and contribution.

To run graph traversals, we need a graph traversal source. A GraphTraversalSource is yet another layer on top of the Graph with additional context like traversal strategies and engines. Different graph systems may provide different implementations of these, but because TinkerPop provides a common API, we can run the same traversal on any “TinkerPop-enabled” system.
The way to create a graph traversal source for the embedded graph is g = traversal().withEmbedded(graph)

Once the traversal source is created, we can start finding answers to questions like “how many people are there on the graph?”

The query g.V().hasLabel(‘person’) will return:

1, 2, 4 and 6 are indeed the IDs of the “person” vertices on the graph.

We can also ask a more complex question like “Which ones of Marko’s colleagues/friends have contributed to the same software as him?”

g.V(1).as('marko')\
.out('knows').as('friend')\
.out('created')\
.in('created').where(eq('marko'))\
.select('friend')\
.values('name')

The result from the above traversal is:

and in fact, if we look at graph, Marko knows Josh and they both collaborated on the same software namely lop.

Hopefully, now you know the basics of Graphs, Apache TinkerPop and Gremlin. In the next post, we will look further into connecting to a remote graph database and running gremlin traversals using python.

Data Engineer @DataReply