RDF Helpers

The RDF Helpers is a Python package that contains some useful functionality that helps when writing programs with the RDFLib library. It was mainly inspired by the realization that there is always a lot of "boilerplate" code when manipulating triples.

Source code: gitlab.com/somanyaircraft/rdfhelpers/

PyPI package: pypi.org/project/rdfhelpers/

The documentation for this package was earlier built using some of the common Python documentation tools, and available from readthedocs.com. We have since opted to use our own tooling that we use for all of our Web site construction.

Installation

The package source code is available here. It is also available on PyPI and can be installed simply via

pip install rdfhelpers

It can also be installed directly from, say, PyCharm. After installation, you can use

import rdfhelpers

in your programs.

Minimal requirements are listed in the requirements.txt file. Additional (optional) packages are listed in requirements-more.txt; these packages are "optional" in the sense that if you do not need their functionality you can still run rdfhelpers without them (for example, if you do not use the PandasMapper class, you do not need to install the pandas package; instantiating PandasMapper without pandas will raise the NotImplemented exception).

Requirements can be installed via

pip install -r requirements.txt

Generally helpful functionality

This section documents many "small" functions that encapsulate functionality we found ourselves writing over and over in our RDFLib-based programs.

def graphFrom(triples, add_to=, graph_class=, **kwargs) Adds triples to a graph; optionally also creates this graph. Parameters:

triples: An Iterable or triples to be added.
add_to: The graph the triples are added to. If None, a new class will be created.
graph_class: The class from which a new graph is created. Defaults to rdflib.Graph.
kwargs: Passed to the graph_class constructor.

The graph into which the triples were added is returned, or None if there were no triples.

def expandQName(prefix, local_name, ns_mgr)

"Expands" a qualified name (a qname or "curie"). Parameters:

prefix and local_name: the qname prefix and local name, respectively.
ns_mgr: a namespace manager, an instance of rdflib.namespace.NamespaceManager, used to fetch the URI corresponding to the prefix.

The expanded result is returned.

def getvalue(graph, node, predicate)

Simple accessor for a single value. Does not check if there are more than one value. If there are no values, returns None.

def setvalue(graph, node, predicate, value)

Sets the value of a predicate. First removes all existing values. If value is None, does not set any values after removing.

from rdflib import Graph, BNode, Literal, RDFS
from rdfhelpers import getvalue, setvalue

g = Graph()
n = BNode()
setvalue(g, n, RDFS.label, Literal("foo"))
label = getvalue(g, n, RDFS.label)

The value of label is now the literal "foo".

def addvalues(graph, node, predicates_and_values)

Adds new statements of the form (node, predicate, value), where predicate and value are the keys and values of the parameter predicates_and_values (a dict), respectively.

def setvalues(graph, node, predicates_and_values)

Like addvalues(), except that instead of rdflib.Graph.add() uses setvalue() to add the new (or existing) predicates and values.

def diff(graph1, graph2)

Calculates the difference of two graphs. Returns a tuple of two lists of triples: the ones that would have to added to graph1 and the ones that would have to be retracted from graph1 to make it the same as graph2.

def mapDict(dictionary, mapper=)

Creates a new dict using the keys and values of dictionary (a dict also). The values can be mapped to new values using the Callable supplied as mapper; if none is specified, values are not mapped, and the function effectively creates a copy of the original dictionary. The new dictionary is returned.

def identity(thing)

This is just a handy "identity function", i.e., a function that merely returns its single parameter. For some reason one is missing from Python.

Concise Bounded Descriptions (CBDs)

def cbd(source, resource, use_sparql=, context=, target=, target_class=, use_describe=, **kwargs)

This function computes the so-called Concise Bounded Description of a given RDF resource (node). Parameters:

source: the input graph (either an rdflib.Graph or rdflib.SPARQLStore or some subclass of theirs).
target: the graph into which the CBD is inserted.
resource: the node (in the input graph) for which we want to compute the CBD.
context: specifies the named graph from which the CBD data is pulled; if it is None, the default graph is used (note that what this is varies based on, say, which SPARQL database is used).
use_sparql: if False (it defaults to True), the CBD is computed recursively, otherwise a non-recursive SPARQL query is used (SPARQL does not support recursive queries). The query probes the graph four levels deep, incrementally; this means that it may fail to compute the full CBD and leave blank nodes as the "leaves" of the CBD tree.
use_describe: if True (it defaults to False), the SPARQL DESCRIBE query is used instead of the fixed-path-length query, but only if context has not been specified.

Note: Some SPARQL databases can compute a CBD via the DESCRIBE query. The unfortunate limitation of DESCRIBE is that it cannot be restricted to specific named graphs.

This function calls graphFrom() to construct a graph from the triples it finds. The remaining keyword arguments are passed to the graph constructor.

def reverse_cbd(source, resource, context=, target=, target_class=, **kwargs)

Computes a CBD, but uses incoming instead of outgoing edges. The current implementation always uses SPARQL with fixed-length paths (max. 4 hops).

Container manipulation

def isContainerItemPredicate(uri)

Tests whether a URI (a string or an rdflib.URIRef instance) is an RDF container membership predicate (e.g., rdf:_1). Returns the index value if it is, or None otherwise. For example, for rdf:_5 it would return 5.

def makeContainerItemPredicate(index)

Creates a container item predicate URI with the given numerical index. Returns a string, not an rdflib.URIRef.

def getContainerStatements(graph, source, predicate)

Given a node and a property, and assuming the value of this property is an RDF container, returns the "item statements" (e.g., [] rdf:_1 "foo") of this container, in sorted order. Parameters:

graph: the graph containing said container.
node: a node in the graph.
predicate: a predicate of the node.

def getContainerItems(graph, node, predicate)

Similar to the function above, but returns a list of the item values only, not the statements.

def setContainerItems(graph, node, predicate, values, newtype=)

Given a graph with a node, and a predicate of that node, sets the value of said predicate to a container with the items given. Parameters:

graph: the graph on which to operate.
node: a node in that graph.
predicate: a predicate of that node.
values: an Iterable of RDF terms, to be used as the items of a new container to be created and assigned as the value of the given predicate. If this is an empty list (or None), removes the container and the predicate.
newtype: the type of the container to be created (it defaults to rdf:Seq).

Specialized graph classes

class FocusedGraph def __init__(self, focus=, **kwargs)

A "focused graph" is an rdflib.Graph with a single vertex designated as a "focal point"; access operations are provided that use this vertex as the subject. Keyword arguments are passed to the method findFocus in case the focus parameter is not provided.

Using Constructor to build a FocusedGraph instance:

from rdfhelpers import FocusedGraph, Constructor, graphFrom
from rdflib import BNode, Literal, RDFS
f = BNode()  # the "focus" node
g = graphFrom(Constructor("?focus rdfs:label ?label").expand(focus=f, label=Literal("foo")),
              graph_class=FocusedGraph, focus=f)
print(g.getvalue(RDFS.label))

will print

foo

@property FocusedGraph.focus

RDF term (URIRef or BNode) that serves as the default subject of access operations.

def FocusedGraph.findFocus(self, **kwargs)

Unimplemented in the base class.

def FocusedGraph.getvalue(self, predicate)

Functionally the same as rdfhelpers.getvalue(self, self.focus, predicate).

def FocusedGraph.setvalue(self, predicate, value)

Functionally the same as rdfhelpers.setvalue(self, self.focus, predicate, value).

class CBDGraph(FocusedGraph) def __init__(self, focus, data, context=, ...)

This is a subclass of FocusedGraph and, when created, contains the CBD of focus (a URIRef), as computed from the data in data (anything the function cbd() can use). The parameter context is passed to cbd(). The other keyword arguments are passed to the superclass constructor.

def CBDGraph.diff(self)

Uses the diff() function to calculate the difference between the instance graph and the data where the original CBD was pulled from.

Templated queries

SPARQL does not really have a templated query mechanism. It is possible to tack a VALUES clause at the end of read queries (SELECT, CONSTRUCT, etc.) to "pre-bind" variables; update queries do not allow this. Our templated query mechanism uses string substitution, and thus should be regarded as something that is more of a convenience rather than a performance optimization.

Rather than using SPARQL variables ?var in templates, we use variables named like this: $var. Also, unlike the "pre-binding" method, we can substitute variables with plain Python strings in addition to RDF terms, and thus you can substitute anything, even longer passages of query code.

class TemplatedQueryMixin

When mixed with a store class that supports the query() and update() methods, allows queries to be written as templates where the parameters to substitute take the form $foo. Anything can be substituted, this is not limited to RDF terms like the initBindings= parameter used in the overridden methods of RDFLib.

def TemplatedQueryMixin.query(self, querystring, **kwargs)

Executes a query using the Templated.query() method (see below). The keyword arguments are the same as for that method.

def TemplatedQueryMixin.update(self, querystring, **kwargs)

Executes an update query using the Templated.update() method (see below). The keyword arguments are the same as for that method.

class Templated

The Templated is the actual implementation of the templated query facility. This class allows queries to be written as templates where the parameters to substitute take the form $foo. Anything can be substituted, this is not limited to RDF terms like the initBindings parameter of standard RDFLib methods. The substitutions are passed as keyword arguments; for example, to substitute $foo with the RDF literal "foobar", you would pass foo=rdflib.Literal("foobar"). The susbstitutions rules are as follows:

Constant RDF terms (URIRef, BNode, or Literal) are converted to their proper SPARQL syntax using the method Templated.forSPARQL().
Strings are inserted into the template as is; this means that you can actually pass SPARQL variables or larger query passages into the template.
The value None is substituted for an empty string. To pass an "unbound" binding to VALUES clause, use the string "UNDEF".
Numbers and boolean values can be used as such, without first turning them into literals.

@classmethod
def Templated.convert(cls, template, **kwargs)

Takes the string template and performs substitutions using the bindings supplied as keyword arguments. The new string is returned.

@classmethod
def Templated.query(cls, graph, template, **kwargs)

Calls graph.query() with a query that results from converting the template string using the method Templated.convert().

@classmethod
def Templated.update(cls, graph, template, **kwargs)

Calls graph.update() with an update query that results from converting the template string using the method Templated.convert().

@classmethod
def Templated.forSPARQL(cls, thing)

Converts thing into a SPARQL-compatible representation. See above for the conversion rules.

class TemplateWrapper
def __init__(self, graph)

"Wraps" a graph so that query() and update() methods will perform template expansion.

Composable graph operations

We introduce two classes for instantiating RDF data, namely Constructor and Composable. The former lets you use SPARQL CONSTRUCT triple patterns to construct new RDF triples, the latter makes it possible to "cascade" CONSTRUCT queries and other RDF-generating actions in a "fluent" style.

Fluent API for composing graph construction

class Composable def __init__(self, graph=)

A Composable instance is an object whose methods for manipulating its contained graph always return either the instance itself or a new Composable instance, enabling "fluent"-style method chaining. The class supports most methods of rdflib.Graph and can thus be used much the same way.

def Composable.__add__(self, other)

Allows Composable instances to be combined (set union of graphs).

def Composable.add(self, *triples)

Adds triples to the current instance.

def Composable.bind(self, prefix, namespace)

Binds a namespace declaration to the current instance.

@classmethod def Composable.parse(cls, args, *kwargs)

Creates a graph by parsing data (using the rdflib.Graph.parse() method).

def Composable.serialize(self, args, *kwargs)

Serializes the current instance using the rdflib.Graph.serialize() method.

def Composable.construct(self, template, **kwargs)

Creates a new Composable instance from the results of a SPARQL CONSTRUCT query applied against the current instance.

@property def Composable.result

Holds the current "result" of the graph construction process (an rdflib.Graph instance). This property is read-only.

Imitating CONSTRUCT queries

Note: at the moment, this feature is considered "experimental" and the API may still change considerably.

class Constructor def Constructor.init(self, template, ns_mgr=)

This class is capable of parsing and instantiating SPARQL CONSTRUCT triple patterns. It is intended to be a shortcut when a full CONSTRUCT query might be too heavy. A Constructor instance retains the parsed result from template (a string) and can be reused without any additional parsing. The parameter ns_mgr can be used to supply a properly initialized namespace manager (if not supplied, the class has one that gets reused).

Simple example:

from rdflib import URIRef
from rdfhelpers import Constructor

Constructor("?x a rdfs:Class").expand(x=URIRef("http://o.ra/Class1"))

returns a generator of triples; the single triple yielded in this case (serialized as Turtle) would be:

<http://o.ra/Class1> rdf:type rdfs:Class .

The Constructor class is capable of parsing any CONSTRUCT triple patterns the normal RDFLib SPARQL parser can.

def Constructor.expand(self, **bindings)

Instantiates the patterns held by a Constructor instance, and returns a generator of triples. Variable bindings can be supplied as normal keywords. Any triple that uses a variable for which there is no binding (or the value of the binding is None) is left out from the result set. If the bound value of a variable is a list (or other iterable), the corresponding triple gets instantiated for each of the items; if there are more than one variable in a triple, with values as iterables, the result is like a cross-product.

@classmethod def Constructor.bind(cls, prefix, namespace, override=, replace=)

Allows new namespaces to be bound to the namespace manager instance held by the class (not an instance, that is). The parameters are the same as for the standard RDFLib method.

@classmethod def Constructor.parseTemplate(cls, template, ns_mgr=)

Takes a CONSTRUCT template (a string) and parses it, returns a list of triples where some terms may be instances of rdflib.term.Variable. A bespoke namespace manager can be supplied.

@classmethod def Constructor.expandTemplate(cls, template, bindings)

Expands (i.e., instantiates) a parsed template, supplied as a list of triples. The parameter bindings must be a dict.

@classmethod def Constructor.parseAndExpand(cls, template, bindings, ns_mgr=)

This class method can be used to parse and instantiate a template without having to create a Constructor instance. When using this method, the above example would look like this:

from rdflib import URIRef
from rdfhelpers import Constructor

Constructor.parseAndExpand("?x a rdfs:Class",
                           {"x": URIRef("http://o.ra/Class1")})

Producers

Note: This section is still incomplete.

"Producers" are functionality (packaged as classes) that "produce" RDF data from some other data via a transformation process. Generally speaking, producers are divided into three categories:

Harvesters: these crawl through data and "harvest" details that can be turned into RDF.
Mappers: these "map" data from one representation into another (in this case, into RDF).
Generators: anything that does not really fit in the above two categories.

The categorization is not absolute but merely serves the purpose of making some sense of the large number of different producer classes we have. In broad strokes, to use a producer requires two calls:

A call to the class constructor (with specific initialization parameters; see below), and
Any number of calls to the producer's run() method (again, with specific parameters).

All producers are built on a general framework of a base class that implements a small number of methods, and generally operate on Composable instances.

Currently, the various subclasses of Producer support the following standards/formats/transformations:

R2RML: relational databases
RML: CSV, XML, JSON, Pandas DataFrames, MS-Excel
XSLT: XML
Other: CSV, CSVW, XMP, file systems, Web sites, APIs

Base classes

class Producer def __init__(self, agent=, data_class=) All producer constructors take the following keyword parameters:

agent: used to specify the "agent" that PROV-O needs (it defaults to prov:Agent)
data_class: the Python class from which the producer's data is instantiated (this defaults to Composable, but any subclass thereof is also acceptable).

def Producer.run(self, activity=, provenance=, **kwargs) -> Composable All Producer.run() methods take the following parameters:

activity: used to specify the "activity" that PROV-O needs (it defaults to prov:Activity).
provenance: if True (the defailt) causes the producer to record provenance after the data has been generated.

This method calls three other methods (initialize, produce, and cleanup) roughly as follows:

def Producer.run(self, activity=None, provenance=True, **kwargs) -> Composable:
    return self.cleanup(self.produce(self.initialize(**kwargs), **kwargs), **kwargs)

That is, the initialize() method creates a Composable instance, it gets passed to produce() which "fills" it, and it gets further passed to cleanup() which can "clean up" the data if needed; finally, the data is returned as the value of run(). Each one of the methods can act on the data as it sees fit, using the facilities provided by the fluent interface mechanism of the class Composable. Only the produce() method is abstract and must be defined by Producer subclasses.

def Producer.initialize(self, **kwargs) -> Composable The default method creates an empty Composable instance and returns it.

@abstractmethod def Producer.produce(self, data: Composable, **kwargs) -> Composable Specializations of this method are expected to add RDF triples into data and then return it.

def Producer.cleanup(self, data: Composable, **kwargs) -> Composable The default method simply returns the data parameter.

class Harvester(Producer) Base class for those producers which, conceptually, "harvest" RDF from some source.

Harvester.run(self, root=, **kwargs) For all harvesters, the parameter root is the starting point of the harvesting run. Exactly what that parameter is depends on the specific subclass of Harvester (it could be a directory in a file system, the URL of a Web site, etc.).

class Mapper(Producer) def __init(self, mapping=, **kwargs)__ All mappers take some kind of a mapping as an initialization parameter. Exactly what that parameter is depends on the specific subclass of Mapper.

def Mapper.run(self, source=, **kwargs) Mappers, when run, expect the source data to be specified (the data to be "mapped" into RDF). Exactly what this source is depends on the specific subclass of Mapper.

class Generator(Producer) Base class for all those producers that would not be classified as either mappers or harvesters. No added functionality compared to Producer. Note that Generator subclasses may introduce new keyword parameters to their __init()__ and run() methods.

Mixin classes

We introduce two "mixin" classes that can be used when subclassing Producer:

class FileReaderMixin This class makes it easy to define subclasses of Producer that need to open and read a file.

FileReaderMixin.run(self, source=, source_kwargs=, **kwargs) Parameters:

source: typically something that can be passed to the Python function open().
source_kwargs: keyword parameters passed to open().
kwargs: passed to the run() method next in the Method Resolution Order (MRO).

class DatabaseConnectionMixin def __init__(self, connection_url=, user=, passwd=, **kwargs) Parameters:

connection_url: a string passed to whatever is the underlying database driver when a connection is created.
user and passwd: user name and password used in opening a connection. Note that the password is stored in the class instance for its lifetime.
kwargs: passed to the method next in the MRO.

Subclasses of `Harvester`

class FileSystemHarvester(Harvester) An abstract class for building harvesters that crawl through a filesystem and collect data. The signatures of \_\_init\_\_() and run() are the same as in the superclass. The parameter root to the run() method is understood to be a path to the root directory system to be crawled. See below for concrete subclasses.

class MinimalGraphProducingFileSystemHarvester(FileSystemHarvester) As the class name suggests, a file system harvest that build a minimal graph (minimal in terms of the properties of each node that represents a folder, a file, or a link).

class GraphProducingFileSystemHarvester(MinimalGraphProducingFileSystemHarvester) Customizable subclass of MinimalGraphProducingFileSystemHarvester. Customization happens by overriding the method addMetadata() which gets called for each file, folder, and link (see below).

def GraphProducingFileSystemHarvester.addMetadata(self, uri, data) A method that adds to data (a Composable instance). The default method adds file creation and modification timestamps. The parameter uri is the URI of the file/folder/link in question.

class XMPHarvester(FileSystemHarvester) __init__(self, topics, cache_dir, photo_class, identifier_prop) A harvester that collects photo metadata. A separate RDF (Turtle by default) file is created for each directory; these fils are deposited in the directory designated by the parameter cache_dir. The parameter topics is either an rdflib.Graph instance or the pathname of an RDF file that can be parsed into a graph; the graph is assumed to contain the concepts of a SKOS concept scheme. These concepts are matched with the raw metadata's dc:subject values (by comparing to the concepts' skos:prefLabel value). The parameter photo_class is the URI of the RDF class from which photos are instantiated, and the parameter identifier_prop is the URI of the predicate that is used to associate each photograph with an identifier (the file name by default); dc:identier would be such a predicate, but you are free to use any predicate of your choice (sub-properties of dc:identifer are recommended).

class WebCrawler(Harvester) def __init__(self, schemes, extensions, html_parser, ignore_robotstxt, user_agent) This harvester crawls (spiders) through the Web. By default, only those URLs are considered that begin with whatever was supplied as the value of root. No data is collected. The parameter schemes is a list of URL schemes to be considered, extensions is a list of URL extensions to be considered, and html_parser is the identifier of the parser used by Beautiful Soup when parsing Web pages. The parameter ignore_robotstxt, when True (it defaults to False) causes the crawler to ignore the rules in the Web site's robots.txt file.

class WebHarvester(WebCrawler) This is a subclass of WebCrawler that actually collects data from pages it crawls through, and builds a graph. Method signatures are the same.

Subclasses of `Mapper`

class TinyRMLMapper(Mapper) This is an abstract class for building mappers that use the RML subset supported by the tinyrml package. Like all mappers, its constructor takes the parameter mapping which in this case can be either an rdflib.Graph instance containing an RML mapping, or a path to an RDF-parseable file that contains the mapping. The run() method takes a parameter source which can either be an Iterable[dict] that is given to the underlying tinyrml implementation directly, or a path to a file which can be read (by subclasses specializing this) resulting in said iterable.

class PlainCSVMapper(TinyRMLMapper, FileReaderMixin) A subclass of TinyRMLMapper where the source parameter of the run() method is assumed to be a path to a CSV file.

class PandasMapper(TinyRMLMapper) An abstract subclass of TinyRMLMapper where the source parameter of the run() method is assumed to be a pandas.DataFrame instance.

class ExcelMapper(PandasMapper) A subclass of PandasMapper where the source parameter of the run() method is assumed to be a path to a Microsoft Excel document.

class R2RMLMapper(Mapper, DatabcaseConnectionMixin) Invokes an R2RML mapper on the database whose connection information was provided to the class constructor.

class XSLTMapper(Mapper) Invokes an XSLT processor (currently xsltproc) on the source (assumed to be an XML file).

Consumers and API clients

Note: This section is still incomplete.

A "consumer" is an object (an instance of the class Consumer) that takes a producer as input and then "consumes" its output via the consume() method. This method can then do what it wants with the produced data (e.g., export it to a file).

Various API clients are provided as a means to building producers that get their data through data APIs.

Discovering and caching node labels

To build responsive user interfaces it may make sense to cache node labels that are to be displayed.

class LabelCache def __init__(self, db, label_query=, selector=, prepopulate=)

A simple, unbounded cache for labels (values of rdfs:label and sub-properties thereof). If prepopulate is true (defaults to false), the cache is seeded from the graph db using the parameter selector: this is a SPARQL pattern that selects the subjects.

def LabelCache.getLabel(self, resource)

Returns one of the labels of resource. In case of a cache miss, the SPARQL query provided to the constructor (the parameter label_query) is used. Prefers skos:prefLabel, then skos:altLabel, then dc:title or any subproperties of rdfs:label.

def LabelCache.invalidateLabel(self, resource)

Invalidates the cached label for resource. Subsequent call to LabelCache.getLabel() will find a new value.

def LabelCache.repopulate(self)

Populates the cache using the query parameters provided to the class constructor.

def LabelCache.findResources(self, substring)

Finds all resources whose label contains the string substring.

def LabelCache.findResourcesForAutocomplete(self, substring)

Calls findResources() and constructs a value that can be passed to a jQuery UI "autocomplete" widget. Note that the value has to be "JSONified" first (e.g., by passing it to flask.json.jsonify() or some equivalent function).

def LabelCache.getAllLabels(self, resource, exclude_default_namespace=)

Retrieves all labels and identifiers of resource (including its URI and possible qname). If exclude_default_namespace is true (the default), qnames in the default namespace are excluded.

class SKOSLabelCache def __init__(self, db, scheme, selector=)

Subclass of LabelCache for SKOS concept labels. The parameter scheme is the identifier of a SKOS concept scheme (an instance of skos:ConceptScheme). All the concepts of this scheme are used to prepopulate the cache.

def SKOSLabelCache.getConcept(self, label)

Finds the concept (instance of skos:Concept) that has the exact provided label as its skos:prefLabel. If labels are unique, the following is satisfied.

cache.getConcept(cache.getLabel(resource)) == resource

def SKOSLabelCache.listConcepts(self)

Produces a list (actually a generator) of all the concepts that are cached.