tech outreach

Modeling Trees: Graphs vs SQL-based MPTT

2018-08-10 4 min read

Back in 2014 I’ve been working for a startup then-called EnergyDeck on behalf of a web studio WebRiders. Even though I started working in tech since I graduated from high school and been working in the field for about 7 years by that time, mostly it was part time on rather smaller projects or internship after or instead of my college hours.

So, here am I, just a year since I finished my master’s degree, working on my first [rather] big[ger] project and enjoying brave new world of real world data modeling problems.

At that particular time, I was working on data processing and analytics over dynamic sets of data series. What is that?

Imagine an organization having few offices
- Each office have few buildings
  - Each building having few floors
    - Each floor having few sections
      - Each section having few rooms
        
        Each rooms having few meters (electricity, temperature, water etc)

And that’s a simplified model. Gosh, what a complicated world we are living in?

Then we started adding tenant-landlord concepts, role-based access, access sharing and so on and so on. The only thing I had in mind at that time was a tree. Or a tree of trees. Or a forest. I did not even know. The only thing I was sure about is that it was a tree, not a graph.

Oh, right, and did I mention SQL? Right! Of course our data were in a SQL database (although in a good one). I can keep going on with details here for hours (curios? invite me for an interview!) but it’s obvious that at some point we started running into performance issues with the simplistic data model we had.

Being a “lead developer” on a project by that time I was looking into potential solutions that were running down to just a couple of items: MPTT or a graph database.

MPTT stands for Modified Preorder Tree Traversal - a technique for storing hierarchical data in a database that aims for efficient retrievals sacrificing performing of insertions. To the best of my knowledge it’s a widely used approach for dealing with hierarchical data in SQL databases and django-mptt (which I actually have used at the time) has few very good references for further reading.

Graph database is a sort of database that allows you to model your data as, well, graphs with nodes (vertexes/vertices) and edges and run semantic queries over them. While they became way more popular recently, they seemed to be a novel bleeding edge technology to me at the time.

Torn between those options I ended up writing a proof of concept using both approaches and running a performance comparison between them. Around the same time I was invited to talk at a small local meetup and prepared a short presentation about my findings.

Unfortunately, the proof of concept itself and all the details about the performance test I did were lost both from my memories and from my backups. I’d not bet on the results I obtained at the time but would take that into account may I face similar problem in the future.

Looking back at the experiment I ran, it seems to be a pretty controversial in some sense. The MPTT implementation was based on Django ORM and therefore suffered from obvious performance issues [as compared to raw SQL]. On the other hand, the graph version was relying on Groovy DSL queries that were transmitted into Titan (now known as JanusGraph) graph engine (running on top of Cassandra storage engine) via Apache TinkerPop facade.

In other words, the results I obtained are less likely represent the best case performance. Instead they showed us “gross” performance we would get with given implementations that we would use may we move forward with any of those options.

As a result, the decision was made to give a try to the solution based in graph database and that’s what I was working on until I left the team to move on with my next job opportunity at National Center for Biotechnology Information.

One of the key takeaways for me was invaluable knowledge and experience I gained experimenting with graph databases. If I were to summarize most valuable references based on that I’d suggest checking out:

Apache TinkerPop - a facade over graph database
JanusGraph - TinkerPop compatible graph engine with multiple storage backend options
Neo4J - another popular “all-in-one” graph database
OrientDB - a “multi-model database” that supports graph semantics