Neo4J Presentation at Twitter Headquaters

Several weeks ago, on March 31, 2010, Twitter hosted a talk about Neo4j at the Twitter headquarters in San Francisco. If you are not familiar with Neo4J it is a database that stores data as a graph, with nodes, relationships (edges), and attributes (of the node). As far as I could glean from the presentation the graph is directed and can be weighted as well. From my perspective there are a few aspects of Neo4J that are quite impressive. First, the simplicity of the API let's you get up and running in very little time. Next, the performance is quite remarkable with 1M nodes with each node having 50 connections, it takes only 2 milliseconds to find out if there is a path between two randomly chosen nodes. It also allows for transactions and rollbacks.

One thing to keep in mind about Neo4J is that it requires a JVM to run but you can still write your code in JRuby or Jython. Another caveat is that Neo4J is not a traditional database server but more like an API that will build your datastore as you invoke the APIs. You can build a service interface determine how you want the service implementation to interact with the APIs and the underlying datastorage. There is, however, a REST server has been developed but, at the talk, it seems like the recommendation is to work with the APIs.

It seems that if you can or need to represent your data in a graph, then Neo4J is a clearly outstanding candidate. Just keep in mind that their license is AGPLv3 so you're product or service built on top of it has to be open source in order to take advantage of the open source Neo4J. For the commercial license, however, the first license is offered for free.

This linke has the original announcement of the tech talk: http://blog.neo4j.org/2010/03/neo4j-meetup-at-twitter-hq-thu-mar-25.html

The following are my notes from the talk. I arrived a few minutes late so my notes are missing some stuff from the beginning.

Neo4J

Presentation by Emil Eifrem
Whiteboard friendly
- Product managers or customers don't think in ER diagrams but more like a whiteboarded use case.
- Easy to follow a graph as opposed to an overly complicated
The Graph DB model: traversal
- Tarverser framework for high performance traversing.
- Example: Mr. Anderson's friend's
- Code (2): Traversing a node space
  - Traverser friendsTraverser = mrAnderson.traverse(..)
    - The params helps determine order (breadth first or depth first), where to start, end, filters, etc.
  - Then you can use an iterator to traverse through the graph
- Can traverse the graph and look only for people that know another person and has some other attribute
  - i.e. this friend (node) "loves" that friend (node)
Bonus code: domain model
- How do you implement your domain model?
- Use the delegator pattern, i.e. every domain entity wrap a Beo4j primitive.
  - Create a bean0like object and hide the get and set operations, which actually sets the values on the underlying node
- Never have to handle concurrency
  - Unless you have concurrent/shared state
- Can define the synchronization strategy in the configuration for the database.
Domain layer frameworks
- Qi4j (www.qi4j.org)
  - Framework for foing DDD (domain driver design) in pure Java 5
  - Defines Entities/
- Jo3neo (http://code.google.com/p/joe4neo)
- Also a Grails framework is being developed.
System characteristics
- Disk-based
  - Native graph storage engine with custom binary on-disk format
- Sharding becomes extremely difficult
- Transactional
  - JTA/JTS, XA, 2PC, Tx Recovery
- Scales up
  - Many billions of nodes on a single JVM
- Robust
  - 6+ years in 24/7
- Not always very well documented
- One caveat, you need to benchmark your own stuff
Performance
- 1000 nodes, pull out 2 random people see if there is path, 50 friends
- Relation database: 2000 ms
- Neo4j: 2ms
- Neo4j: 2ms with 1M nodes and 50 friends
Pros & Cons compared to RDBMS
- Good: No OR impedance mismatch (whiteboard friendly)
  - Not an object store
- Good: Can easily evolve schemas
- Good: Can represent semi-structured info
- Good: Can represent graphs/networks (with performance)
- All these NOSQL data models are isomorphic
  - Find out which data model applies well to this problem
- Bad: lacks in tool and framework suport
- Bad: fewer implementations => potential lock in
  - Hopefully this will change
- Neo4j is still the best for graph databases
- Bad: no support for ad-hoc queries
  - Is becoming less true
  - MongoDB has an edge over CouchDB because it has ad hoc queries
Query languages,
- SPARQL - "SQL for linked data" (Web 3.0 Semantic Web)
- Gremlin - "perl for graphs"
  - XPath influenced
- Launched try.neo4j.org
  - Has a console for playing with neo4j with Gremlin
Summary
- Ne4j is an embedded database
  - Tiny teeny lil jar file (500k)
- Component ecosystem
  - etc.
Language bindings
- Python
- JRuby
- Clojure (4 bindings)
- Scala
- Neoclipse tool
Scale out - replication
- Neo4j HA in internal beta testing with customers now
- Master-slave replication, 1st configuration
  - MySQL style... ish
  - Except all instances can write, synchronously between writing salve & master (strongly consistent)
  - Updates are asynchronously propagated to the other salves (eventual consistency)
  - ... but no 100B
Scale out - partitioning
- To get to 100B, you need automatic sharding of the graph
- "Sharding possible today"
  - ... but you need to do manual work
  - ... just as with MySQL
  - Great option
- Transparent partitioning? Neo4j 2.0
  - 100B? Hard to do.
  - Fundamentals: BASE &
  - Generic clustering algorithm as base cases, but give lots of knobs for developers
Future stuff
- Neo4j HA 1.0
- REST-ful API
- Better language integration in particular PHO
- Framwoek integration: Roo, Grails, Django
- Neo4j 1.1
  - IMproved JMX/SNMP
  - Scale up >>4B+ (64-bit.. or nay?)
  - Event framework (feedback please)
  - Indexing integration
- Neo4j Spatial (uDig integr, {R-Quad}-tree, Hillbert out-of-the-box
- Client tools: Webling, Neoclipse, Monitoring console
Other impls
- http://agraph.franz.com
- http://sones.com
- http://kloudshare.com
- Google Pregel http://bit.ly/dP9IP
- Flock - Twitter graph database; open sourced in mid-April
License
- AGPLv3 /commercial license
- if you're open source, we're open source
- The first license is free.

Q&A or Other Info
- Can you use disconnected graphs? Yes.
- The 1.1 release will upgrade the traverser framework significantly
  - Will be able to filter traversal based on attributes
  - A lot of people have asked for being able to make the determination of traversal at each node
    - e.g. Dijkstra
- Can you do an accumulation? Does this node have more than five friends?
  - Can do the counting in the returnable evaluator.
- Focused on high performance traversals
- Works really well with Lucene
- SPARQL support
- What's the biggest graph? For 1.0 product.
  - Billions
  - 4 billion is the upper limit
- Not just an in memory graph
  - Can sppol stuff to dissk
- If you have a warm cache, then you can do 1-2M traversals per second.
- The traverser framework ensures that you only visit each node once.
- How do you determine the uniqueness of the node?
  - The node id.
- The graph has either directed or non-directed relationships. Can specify when you insert into the graph.
- Want to go in the direction of being parallelizeable
- How to deal with changes to the domain model?
  - There is a migration framework but it is not ideal because it requires code. It work by keeping a version of the node space. There is also a version of the code.
  - With a schema-less database, you can do things on demand, if you want to.
  - The tool support for this needs to be improved.
- Is there any TCP layer to interact with Neo4j?
  - Working on a REST API
  - It's very chatty so best to build your "database" app and then expose the "getFriends" API over REST
- Have and embedded database and have interfaces that wrap the database and not have to worry from the application what is going on behind the scenes
  - Services architecture, thinks this is a strong architecture
    - Says this is a hard sell to a lot of people but believes this is a good architecture
- When does MySQL do better?
  - Highly structured data doing ad hoc queries
- DBPedia (http://dbpedia.org/About)
- I/O to the disk is the biggest problem for performance
- Twitter looked at Neo4j and 4B was not enough
- Uptake is a lot of web stuff
  - Can talk about web stuff
  - But can't talk about the banks and finance
  - Intelligence community might be interested in this