What is a content repository

19 November 2009

5 minute read

Joint post of Henri Bergius and Michael Marth cross-posted here and here.

Web Content Repositories are more than just plain old relational databases. In fact, the requirements that arise when managing web content have led to a class of content repository implementations that are comparable on a conceptual level. During the IKS community workshop in Rome we got together to compare JCR (the Jackrabbit implementation) and Midgard's content repository. While in some cases the terminology might be different, many of the underlying ideas are identical. So we came up with a list of common traits and features of our content repositories. For comparison, there is also Apache CouchDB.

So, why use a Content Repository for your application instead of the old familiar RDBMS? Repositories provide several advantages:

Common rules for data access mean that multiple applications can work with same content without breaking consistency of the data
Signals about changes let applications know when another application using the repository modifies something, enabling collaborative data management between apps
Objects instead of SQL mean that developers can deal with data using APIs more compatible with the rest of their desktop programming environment, and without having to fear issues like SQL injection
Data model is scriptable when you use a content repository, meaning that users can easily write Python or PHP scripts to perform batch operations on their data without having to learn your storage format
Synchronization and sharing features can be implemented on the content repository level meaning that you gain these features without having to worry about them

feature	JCR / Jackrabbit	Midgard	CouchDB
content type system	In JCR structured or unstructured nodes are supported and can be mixed at will in a content tree.	Content types are defined in MgdSchema types. All content must be stored to an MgdSchema type, but types can be extended on content instance level using the "parameter" triplets	Type-free
type hierarchy	Structured node types support inheritence of types, additional cross-cutting aspects can be added with "mixins". Node types can define allowed node types for child nodes in the content hierarchy.	MgdSchemas allow inheritance, and an extended type can be instantiated either using the extended type or the base type	Type-free
IDs	Nodes with mixin "referenceable" have GUID. In practice the node path is often used to reference nodes.	Every object has a GUID used for referencing. Objects located in trees that have a "name" property can also be referred to using the path	All objects can be accessed via a UUID
References	Nodes can reference each other with hard link (special property type) or soft link (by referring to the node path)	MgdSchema types can have properties linking to other objects of same or different type. A link of "parentfield" type places an MgdSchema type in a tree.	No reference support built-in
content hierarchy	All content is hierarchical / in a tree	Content can exist in tree, or independently of it depending on the MgdSchema type definition	flat structure
interesting property types	Multi-valued (like an array), binary properties (e.g. for files), nodes have an implicit sort-order	Binary properties stored using the Midgard Attachment system	Support for binary properties
transactions	Multiple content modifications are written in transactions.	Transactions can be used optionally.
events	JCR Observers can register for content changes on different paths and/or for different node types and/or CRUD, receive notification of changes as serialized node	All transactions cause both process-internal GObject signals, and interprocess DBus signals	Support for one external event notification shell script
workspaces	Workspaces provide separate root trees.	No workspaces support in Midgard 9.03, coming in next version	Multiple databases within one CouchDB instance
import and export	nodes or parts of the repository (or the whole repo) can be imported or exported in XML. 2 formats: docview for human-frindly representation, sysview including all technical aspects	Objects can be exported and imported in XML format. There are tools supporting replication via HTTP, tarballs, XMPP, and the CouchDB replication protocol	JSON serialization is the standard way of accessing the repository. CouchDB replication protocol supports full synchronization between instances
versioning	Checkin/checkout model to create new versions of nodes, optionally versions complete sub-trees, supports branching of versions.	No versioning	All versions of content are stored and accessible separately, no branching
locking	Nodes can be locked and unlocked	Objects can be locked and unlocked
object mapping	Not in standard, but implemented in Jackrabbit. Rarely used in practice.	Object mapping is the standard way of accessing the repository	All content is accessed via JSON objects
queries	In JCR1 Sql or XPath, in JCR2 also QueryBuilder.	Query Builder	Javascript map/reduce
access control	Done on repository level, i.e. all access control is independent of application. In Jackrabbit: pluggable authentication/authorization handlers.	No access control in Midgard repository, usually implemented on application level. Midgard proves a user authentication API	No access control
persistence	In Jackrabbit different Persistence Managers can be plugged in (RDBMS, tar file, ...)	libgda allows storage to different RDBMS like MySQL, SQLite and Postgres	CouchDB has its own storage
architecture	Jackrabbit: library (jar), JEE resource, OSGi bundle or standalone server	Library	Erlang-based daemon
APIs	Standard: Java-based, PHP coming up. In Jackrabbit: also WebDAV and HTTP-based API	C, Objective-C, PHP, Python	HTTP+JSON
full-text search	Included in repository. In Jackrabbit: Lucene bundled	No (SOLR used on application level)	Plugin for using Lucene, not installed by default
standard metadata	All nodes have access rights, jcr:primaryType and jcr:mixinTypes properties. JCR 2.0 standardizes a set of optional metadata properties.	All objects have a set of standard metadata including creator, revisor, timestamps etc	No standard properties

Continue reading

Decoupling Content Management

Web CMSs: what does Midgard do?