The Plasma graph query engine
The Neo4J team recently blogged about a graph query language called Cypher, and reading their post got me motivated to write about something I’ve been working on for a while. Plasma is a graph query engine written in Clojure. Currently it sits on top of the Jiraph graph database that was written by Justin Balthrop and Lance Bradley for a geneology website, Geni. (It would be less than a days work to get it running on top of another graph database though.) The query engine is built using a library of query operators that are combined to form dataflow graphs, and it uses Zach Tellman’s asynchronous events library, Lamina, to provide concurrent, non-blocking execution of queries.
Query language examples
These query examples will be running against the test graph shown below (click for a big version). This graph represents the ACME company, which has three Swiss offices in Lugano, Geneva and Zurich. Each office has a manager, and the company produces a set of components which are combined to form widgets. (Have thoughts on more interesting example graphs? Please leave them in the comments because I’d like some more.)
The full code for these examples, including the graph construction, is available in the github project here.
In the examples below I’ll be using the plasma.query namespace:
(require '[plasma.query :as query])
The functions in the query API operate on and produce a query plan, which can then be executed any number of times by passing it to the query function.
Select nodes with path expressions
; select all component nodes using a basic path expression
(query/path [:product :component])
; same query, using a regular expression for the component edge
(query/path [:product #"comp.*"])
; same query, with bindings to the product and component nodes
(query/path [product [:product]
component [product :component]]))
A path expression starts at the root node of the graph by default, but you can start from another node by passing an argument to the query function when executed or by using the starting node’s id as the first element of the path.
The elements of a path expression are edge predicates. Keyword and regexp predicates will be matched against the :label property of an edge, and a map of predicates can be used to compare multiple edge properties.
Executing the queries above will return a list of maps containing the ids of the component nodes.
(query (query/path [:product :component]))
; result:
({:id "UUID:6df19085-0342-41a8-90b9-4d23f076e29d"}
{:id "UUID:16e2c756-2f44-4414-a007-417a19ba71a0"}
{:id "UUID:16e2c756-2f44-4414-a007-417a19ba71a0"}
{:id "UUID:ae990134-acfd-40f0-aef5-413be7b1002b"}
{:id "UUID:6729a33b-6cb7-44ad-921b-9a2ed0fe2bfc"})
Project node properties
(-> (query/path [comp [:product :component]])
(query/project ['comp :label :cost]))
; result:
({:cost 24.0, :label :component-1}
{:cost 8.0, :label :component-2}
{:cost 8.0, :label :component-2}
{:cost 75.0, :label :component-3}
{:cost 120.0, :label :component-4})
You can also project multiple nodes along the path, and optionally rename properties in case of conflicts, like the :label property here.
(-> (query/path [p [:product] c [p :component]])
(query/project ['p [:label :as :product-label]] ['c :label :cost]))
; result:
({:cost 24.0, :label :component-1, :product-label :alpha}
{:cost 8.0, :label :component-2, :product-label :alpha}
{:cost 8.0, :label :component-2, :product-label :beta}
{:cost 75.0, :label :component-3, :product-label :beta}
{:cost 120.0, :label :component-4, :product-label :beta})
Filter using the where clause
; Select the components that would cost more than $500 if using a 5x markup
(-> (query/path [component [:product :component]])
(query/where (> (* 5 (:cost 'component)) 500))
(query/project ['component :label :cost]))
; result:
({:cost 120.0, :label :component-4})
Note that the where form supports arbitrary expressions involving node properties. All arithmetic functions, boolean logic, bit operations and many java.lang.Math functions (e.g. trig, log, etc..) are supported and its easy to add more.
The where filter doesn’t always have to apply to the last node of a path expression. Here is a query that will select products with components made in Geneva.
(->
(query/path [product [:product]
component [product :component]
location [component :made-in]])
(query/where (= (:label 'location) :geneva-office))
(query/distinct* 'product)
(query/project ['product :label]))
; result:
({:label :beta})
Sort with order-by, limit, and choose results
; Get the most expensive product
(-> (query/path [product [:product]])
(query/order-by 'product :price :desc)
(query/project ['product :label :price])
(query/limit 1))
; result:
({:price 600.0, :label :beta})
The choose operator can be used to select a random subset of the results.
; Get a random manager node
(-> (query/path [:location :managed-by])
(query/choose 1))
; result:
({:id "UUID:8f53c132-4f7e-4ef9-a192-2ff069d4ac61"})
and more…
In addition you can compute aggregate values over specific properties using query/min, query/max, query/avg, and query/count*. There is also a basic graph construction API to make it easier to store sub-graphs programmatically. Checkout query.clj on github for examples.
Distributing Graph Queries
The Plasma engine has also been designed to support distributed queries, where a path expression can cross from one graph to another on the local machine, or from one machine to another over the network. In a future post I’ll show an example graph that contains proxy nodes, which act as bridges to remote graphs. Distributed path expressions can be used not only to query large graph databases spread across many servers, but they can also be the basis for a new kind of distributed system. I’m not sure what the right role for such a distributed graph network is yet, but I think it has some interesting possibilities for supporting a peer-to-peer infrastructure.
Jun 12 2011
Atom Feed
