2000-11-06 RDF, squish etc Dan Brickley, Libby Miller This note discusses the relationship between low-level RDF APIs and SQL-like query interfaces. During the period 1997-2000 we have seen various proposed 'graph APIs' for the RDF data model. In Java we have SiRPAC now using the Stanford RDF API, the proposed Jena interface, and a number of others. In C, the evolving Mozilla RDF APIs and more recently the Redland distribution offer similar facilities. These APIs are typically couched in terms of either the 'graph' metaphor (nodes and arcs, properties and resources) or in terms of 'statements' (triples). Regardless of terminlogy, we see similar functionality in most of these APIs. Typically some mechanism exists for navigating stored RDF data (e.g.. 'GetTargets(propertyname) call in Mozilla), as well as some mechanism for requesting matches from the database given some partially completed template statement. For example, many systems allow applications to call something like a 'triplesMatching(propertyname, null, null)' method to ask for all the RDF statements that use some specified predicate. The storage and query systems that expose such APIs exploit a wide variety of implementation strategies.RDF provides a highly general information model; as a result, a wide variety of data services can be encapsulated behind an RDF-oriented interface. There is no clear dividing line between systems that 'really' store RDF and ones which 'merely' manifest an RDF view. For example, Mozilla uses RDF as an abstraction layer to wrap a number of services that are not 'general RDF databases' (e.g.. mail/news). This note focuses on general RDF stores, i.e.. systems capable of representing any arbitrary RDF model, rather than those with a more constrained information model but which nevertheless manifest and RDF interface. Considering only these 'general' RDF databases, there have typically been three main categories of implementation. Firstly we have the simple in-memory database. These systems offer graph navigation and partial statement matching against a large data structure held entirely in memory. Persistent RDF databases have most frequently been built on top of relational databases and simpler systems such as Sleepycat's BerkeleyDB. The remainder of this section discusses various options for exposing SQL-like interfaces on top of these various kinds of 'generic' RDF database. It should first be noted that a simple un-augmented RDF API (such as sketched above) is enough to construct SQL-like query systems on top of. We know this because, in the worst case, an API that offers the ability to 'return all statements' (i.e.. dump the entire database to client applications) clearly exposes enough information for external query engines to manipulate. When efficiency is not a major concern, this provides a useful backup strategy: any general RDF database that exposes a basic API can have richer query interface wrapper around it. For developers, this can be a big win: instead of applications having to grovel around the RDF data store using the graph metaphor, or interact with it at a highly granular statement-by-statement level, we can offer a more abstract interface. If I want to ask our Intranet for the homepages and phone numbers of staff who are working on projects that are funded by the organisation whose home page is http://www.jisc.ac.uk/, I would like to be able to write a single expression represent that query rather than have to write a dozen lines of code to navigate the data graph 'by hand'. The existence (and success) of SQL and relational databases has conditioned developers to expect a certain degree of abstraction from databases. In the RDF world, we are only now beginning to see systems that offer such an interface. Before discussing implementation details any further, it is worth stepping back and contrasting the RDF and SQL/RDBMS approaches. our goal is to find something that fills an SQL-like role, but using the RDF information model. Why is this an interesting thing to do? Some observations on SQL and the Web: SQL/RDBMS systems have traditional fragile, static SQL-based databases, while hugely successful, have a tendency towards fragility. Unless an extremely general RDBMS schema is used (more on which later), SQL systems require application developers to decide in advance on the kind of things their database will be able to store, and the kinds of inter-relations that hold between the entities represented in the database. While it is possible to evolve and extend RDBMS systems over time, this is rarely convenient. With predictability comes efficiency: a database that knows from the outset what it'll be doing has a reasonable chance of indexing and storing data efficiently. Another effect of this is the tendency for SQL-based systems to entangle implementation details with abstract application-level models. If I write an application that needs to embed SQL queries, that application becomes tightly coupled not only to the abstract 'model of the world' (i.e.. entities and relationships etc) I've used, but also to the nitty gritty detail of how those entities and relationships are stored in my particular RDBMS. These are related problems, and they have become increasingly annoying over time. With networked computing becoming ubiquitous, databases increasingly have to talk to one another, and have to describe the _same_ entities, relationships, attributes etc. Applications need to become less tightly coupled to any particular storage/representation, i.e.. avoid commitment to any particular storage strategy. Such observations are commonplace, and the computing industry has been busy exploring alternatives and coping strategies for dealing with the shortcomings of the RDBMS approach (e.g.. encapsulating everything via Enterprise Java Bean wrappers, using Object databases, projecting out LDAP interfaces, or using the magic of XML...). Our concern here is more limited: we hope to connect these issues to some questions facing RDF implementors. How can we augment our basic RDF APIs with SQL-like facilities? How can implementation exploit meta-information about the RDF backend storage system, without being tightly coupled to any particular storage strategy? How can query processors and database indexes be constructed to work in the unpredictable, heterogenous Web data environment? case study: Implementing an SQL-like QL on top of three different Java RDF APIs There are a number of Java RDF APIs - here are 3 examples: * RDFModelCore/RDFGraph (Dan Brickley) * Model/ModelImpl in Jena (Brian McBride) * Model/ModelImpl in StanfordAPI (Sergey Melink) All have similar features including a triples-matching method, e.g. in RDFModelCore this is RDFModelCore triplesWhere(subject, predicate, object) We had previously built a SQL-like query language implementation on top of RDFModelCore, to try and replicate some of the functionality of R.V. Guha's RDFDb, and thought it would be an interesting experiment to try to generalize these classes to use these different (but similar) APIs, in two ways * using the in-memory implementations provided with the distributions * hacking an implementation of each that talked to a Postgres SQL backend. Dan Brickley suggested implementing the JDBC API because it is well-used and familiar. The JDBC API is used for accessing SQL databases via a Driver written for the particular database type which enables a Connection object to be made. The Connection object is used to create a Statement object which queries the database and returns a ResultSet object. The API consists of Interfaces to be implemented. In this implementation many methods are not implemented, usually because RDF does not yet use things like transactions, for example. In this implementation, Statement does most of the hard work. It: * parses the query -> triple objects * orders the query triples so that there are no dangling bindings * makes the queries in order, one by one (unoptimised) to the underlying storage using the triplesWhere method (or similar) * forms a java.sql.Resultset object which is returned. Following the SQL model, query triples are treated as conjunctions, i.e.: select ?x, ?y, ?m where ({http://xmlns.com/foaf/0.1/livesWith} ?x ?y) ({http://xmlns.com/foaf/0.1/mbox} ?y ?m) ({http://xmlns.com/foaf/0.1/mbox} ?x mailto:daniel.brickley@bristol.ac.uk) is interpreted as select ?x, ?y, ?m where ({http://xmlns.com/foaf/0.1/livesWith} ?x ?y) & ({http://xmlns.com/foaf/0.1/mbox} ?y ?m) & ({http://xmlns.com/foaf/0.1/mbox} ?x mailto:daniel.brickley@bristol.ac.uk) and will not return a row unless all the slots are filled in that row. Statement and Connection are API-independent; the only restriction in the API implementation (whether in-memory or not) is that it implements a particular form of the triplesMatching(s, p, o) method, returning an RDFModelCore object (API specific), forming a wrapper for the API's own implementation of this method. This is implemented by making the model conform to a tiny interface. The API-specific classes are the Driver and the in-memory and SQL front end classes. Using this JDBC API means that details of the access to the API and details of the access to the SQL store (if used) are hidden. The software as written is inefficient because it does not use any optimization to access SQL databases. But it does show that it is easy (using this implementation something similar) to turn a basic RDF API into an SQL-like query language.