# Betting the Company ## (Literally) on a # Graph Database ### Tips, Tricks, and Lessons Learned [Aseem Kishore](http://aseemk.com/)
Jan–Nov 2013

Hi guys. My name is [Aseem Kishore](http://aseemk.com/). I'm a developer at a startup here in NYC called [FiftyThree](http://www.fiftythree.com/). We make an iPad app called [Paper](http://www.fiftythree.com/paper). Before I joined FiftyThree, I built a startup with a friend called [The Thingdom](http://www.thethingdom.com/), which was a social network around products. Early on in The Thingdom's life, we decided to build on [Neo4j](http://www.neo4j.org/), a graph database. And since every choice you make when you're a startup (especially early on) is literally "betting the company", choosing to build on a graph database was definitely a significant one. Over this talk, I'll share some of the lessons we learned along the way, but I'll also cover the basics around graph databases generally and Neo4j specifically.

[![](/images/neo4j-lessons-learned/fiftythree-hp.png)](http://www.fiftythree.com/)

[![](/images/neo4j-lessons-learned/thingdom-hp-gasi.png)](http://www.thethingdom.com/)

[![](/images/neo4j-lessons-learned/zoomit-hp.png)](http://zoom.it/)

![](/images/neo4j-lessons-learned/calvin-hobbes-ignorance.jpg)

[![](/images/neo4j-lessons-learned/neo4j-viz.png)](http://www.neo4j.org/)

![](/images/neo4j-lessons-learned/thingdom-graph-basic.png)

![](/images/neo4j-lessons-learned/thingdom-recommendations.png)

![](/images/neo4j-lessons-learned/meme-camera-phone.jpg)

[![](/images/neo4j-lessons-learned/gasi.jpg)](http://gasi.ch/) Daniel Gasienica
[@gasi](https://twitter.com/gasi)

# So… ## # Just what is a # graph database?

![](/images/neo4j-lessons-learned/GraphDatabase_PropertyGraph.png)

And indeed, that's how Neo4j stores its data. It happens to use doubly-linked lists for everything, but the idea is the same: for each node, there's a list of its incoming and outgoing relationships. Neo4j also uses fixed-size records for both nodes and relationships, with each record offset by its ID, so that if you know the ID of a node or relationship, a direct lookup is O(1). To achieve the fixed-size records, the records contain only "head" pointers to their linked lists. This is really useful to know, because it shows you a few properties of Neo4j (and maybe graph databases in general): - It's great at localized searches. E.g. to get the people you follow, it just needs to follow your node's linked list of relationships -- and the performance of this won't change if there are 100 people globally or 1M. - It's not great at aggregation. E.g. the nodes or relationships aren't stored in any sorted order, so deriving the 20 most popular users requires a full scan. - It suffers from the "supernode problem". At least currently, a node's neighboring relationships are stored as a flat list, so if you have a million followers, fetching even one person you follow is slow. This could be solved by storing neighbors as a hash table or a B-tree, but there are other trade-offs then, too. We'll come back to this. (Slides from [Tobias Lindaaker](https://github.com/thobe), Neo4j engineer; retrieved from [slideshare.net](http://www.slideshare.net/thobe/an-overview-of-neo4j-internals))

![](/images/neo4j-lessons-learned/relational-to-graph.png)

By definition, a graph database is any storage system that provides index-free adjacency.

This means that every element contains a direct pointer to its adjacent element and no index lookups are necessary.

![](/images/neo4j-lessons-learned/meme-joins.jpg)

# Okay... ## # Let's talk about # what we learned

# Our usage ## **Node.js** + ## **REST API** + ## **Cypher**

[![](/images/neo4j-lessons-learned/thingdom-node-neo4j.png)](https://github.com/thingdom/node-neo4j)

![](/images/neo4j-lessons-learned/thingdom-graph-basic.png)

# What we learned

Unique, expressive relationship types

Sounds simple, but that's an important lesson that was reinforced to us over and over. It's tempting to stay "pure" by using "simple" names like `likes` and `follows` and `author` everywhere, but don't overload them. It's always easy to query for both types of relationships when the names are different (just an `OR` in Cypher), but it's not possible to (efficiently) query for just one type when the names are the same. E.g. at some point, we were considering letting people "follow" categories/brands too, but we resisted the temptation to reuse the `follows` relationship name. If we had, we wouldn't have been able to efficiently traverse just the *users* someone followed, or vice versa. And again, the way Neo4j stores data on disk, the relationship type is the only way it knows during a traversal whether to visit a node or not. Unique and expressive relationship types can save you a lot in performance.

![](/images/neo4j-lessons-learned/thingdom-people-stats.png)

![](/images/neo4j-lessons-learned/thingdom-graph-stats.png)

![](/images/neo4j-lessons-learned/webadmin-user-stats.png)

# What we learned

Unique, expressive relationship types
Cache stats where possible

![](/images/neo4j-lessons-learned/thingdom-graph-events.png)

I showed earlier that our basic graph was that users were connected to things through `has` and `wants` relationships. That worked fine for the present -- this moment in time -- but how does that work over time, when those relationships can change? E.g. a user wants the new iPhone before it's out, then when it's out, he/she gets one, then months later, he/she loses it by accident, etc. We obviously want traversals to be efficient when we're querying e.g. who wants this iPhone, or what things does a user have. But we may also want to "remember" that history, as we did. To achieve both those things, we settled on a philosophy that worked well for us: we decided that relationships reflect the "state of the world" at this moment in time, and *event nodes* could be used to capture history. So in this case, we came up with "have/want event" nodes, which, like `has` and `want` relationships, connected users with things, but also stored (as properties on the node) the type of change (e.g. `has` → `wants`). This was better than e.g. keeping relationships around forever and storing `deleted` properties on relationships we should ignore, because traversal performance would have suffered.

# What we learned

Unique, expressive relationship types
Cache stats where possible
Capture history through event nodes

![](/images/neo4j-lessons-learned/thingdom-graph-suggestions.png)

But we also learned another lesson from that experience: we noticed that these event nodes let us (and our users) do things we couldn't do with the simple `has` and `wants` relationships. Other users could "like" an event node, or comment on it, or be @mentioned by a decsription on it. That taught us another important lesson: nodes let something truly be "first-class" in a way that relationships couldn't. And that's simply because relationships can't point to other relationships. (These are known as "hyperedges" in graph terminology. Some graph databases may support them, but Neo4j doesn't. I used to wish that it did, but I'm realizing that ultimately, it's not a big deal, since you can easily achieve hyperedges with node and relationship "primitives" like this.) So at some point, when we implemented a feature to let users suggest things to other users ("Alice thinks you might have or want an iPhone"), we had no choice but to implement those suggestions as nodes (since they connected three nodes, not just two), but we were happy to do so, since that again let us do "first-class" things with them, like letting other users comment on them. Win-win.

# What we learned

Unique, expressive relationship types
Cache stats where possible
Capture history through event nodes
First-class objects ⇒ nodes, not rels

![](/images/neo4j-lessons-learned/webadmin-thing-categories.png)

![](/images/neo4j-lessons-learned/thingdom-graph-categories.png)

# What we learned

Unique, expressive relationship types
Cache stats where possible
Capture history through event nodes
First-class objects ⇒ nodes, not rels
Connected data ⇒ nodes, not props

Finally, maybe the most important lesson we learned was around our activity feed implementation. This was probably the most important lesson because it's not necessarily obvious if you're new to graphs / graph databases, but it's also one of the most fundamental. As we saw with Neo4j's file format, the performance of traversing a node's relationships scales linearly with the number of those relationships. Having unique, expressive types on those relationships lets you cut down on the number of *nodes* that are visited, but each of the *relationships* must still be visited, because the relationships are (currently, at least) all stored as one, flat list. So when a node starts to have many, many relationships, efficient traversal becomes a problem for that node. This is called the **supernode problem**. And this is indeed what happens when you implement "event" nodes for users like we did: the number of relationships for a user keeps growing and growing (because the number of event nodes they're connected to keeps growing and growing). So when we implemented our activity feed, we went the simple/naive way of using a Cypher aggregation on users' event nodes: essentially, an `ORDER BY` followed a `LIMIT`. This meant that Neo4j had to traverse every relationship -- *and* visit every event node -- in order to determine the top (most recent) events. This kind of aggregation is obviously not making good use of a graph database, and we saw how this didn't scale as our data grew. This picture shows what we *should* have done: make use of the most basic graph data structure -- a linked list. We should have maintained a linked list for each user's events, and appended to (the front of) it on each new event for that user. This would let us follow just the first 10 (or 20, or whatever) relationships -- a constant number instead of growing with the number of events. The supernode problem still exists for the first relationship, since the user node has many relationships. The Neo4j team has plans to remove this problem by reworking the file format (e.g. storing a node's relationships as a hash table or B-tree instead of a flat list), but until then, you can simply use a global index: for each user, index the "head" relationship for that user to his/her most recent event node. Hat-tip to [René Pickhardt](http://www.rene-pickhardt.de/) for opening my eyes to this approach through his work on [Graphity](http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/), a project which applies the same idea to *aggregate* feeds as well.

# What we learned

Unique, expressive relationship types
Cache stats where possible
Capture history through event nodes
First-class objects ⇒ nodes, not rels
Connected data ⇒ nodes, not props
Maintain linked lists for O(1) queries

# Neo4j 2.0+

Labels & constraints
Relationship type grouping
Transactional Cypher
Automatic sharding?

Going forward, we're pretty excited for a few improvements and developments coming down the pipe in Neo4j 2.0 and beyond. - Nodes will be able to have optional "labels" (analagous to relationship types), and auto-indexing will be much more tightly integrated into the database with "constraints" attached to labels. This'll be huge. - They're aware of the supernode problem and have plans to fix it, as I mentioned. Storing relationships as a map or B-tree will go a long way, but I understand there are trade-offs to any representation. - Cypher continues to improve at an impressive pace, but with 2.0, you're able to group multiple Cypher queries — potentially even across multiple HTTP/REST requests — into a single transaction. This is big for consistency and robustness. - Graphs aren't trivially partitionable, as I mentioned, but if anyone has the knowledge to figure out how to do it heuristically — and connect the pieces together — it should be a graph database. I believe they're already working on this problem, which is pretty cool. Overall, it's been great to grow with Neo4j and be a part of the community, and we're still as excited as ever for its future. **Update:** the Neo4j team in fact just published a blog post on this stuff! [What's coming next in Neo4j](http://blog.neo4j.org/2013/01/2013-whats-coming-next-in-neo4j.html)

[![](/images/neo4j-lessons-learned/fiftythree-hp.png)](http://www.fiftythree.com/)

## And check out...

(MySQL)—[:to]—>(Neo4j)

## A DBA Perspective ## Dave Stern @ 11:30

# Thanks! ### Twitter: [@aseemk](https://twitter.com/aseemk) ### GitHub: [@aseemk](https://github.com/aseemk) ### Email: [aseem.kishore@gmail.com](mailto:aseem.kishore@gmail.com) Questions?