RDF Helpers
The RDF Helpers is a Python package that contains some useful functionality that helps when writing programs with the RDFLib library. It was mainly inspired by the realization that there is always a lot of "boilerplate" code when manipulating triples.
Source code: gitlab.com/somanyaircraft/rdfhelpers/
PyPI package: pypi.org/project/rdfhelpers/
The documentation for this package was earlier built using some of the common Python documentation tools, and available from readthedocs.com
. We have since opted to use our own tooling that we use for all of our Web site construction.
Installation
The package source code is available here. It is also available on PyPI and can be installed simply via
pip install rdfhelpers
It can also be installed directly from, say, PyCharm. After installation, you can use
import rdfhelpers
in your programs.
Generally helpful functionality
This section documents many "small" functions that encapsulate functionality we found ourselves writing over and over in our RDFLib-based programs.
- triples: An
Iterable
or triples to be added. - add_to: The graph the triples are added to. If
None
, a new class will be created. - graph_class: The class from which a new graph is created. Defaults to
rdflib.Graph
. - kwargs: Passed to the graph_class constructor.
The graph into which the triples were added is returned, or None
if there were no triples.
"Expands" a qualified name (a qname or "curie"). Parameters:
- prefix and local_name: the qname prefix and local name, respectively.
- ns_mgr: a namespace manager, an instance of
rdflib.namespace.NamespaceManager
, used to fetch the URI corresponding to the prefix.
The expanded result is returned.
Simple accessor for a single value. Does not check if there are more than one value. If there are no values, returns None
.
Sets the value of a predicate. First removes all existing values. If value is None
, does not set any values after removing.
from rdflib import Graph, BNode, Literal, RDFS
from rdfhelpers import getvalue, setvalue
g = Graph()
n = BNode()
setvalue(g, n, RDFS.label, Literal("foo"))
label = getvalue(g, n, RDFS.label)
The value of label
is now the literal "foo"
.
Adds new statements of the form (node, predicate, value)
, where predicate and value are the keys and values of the parameter predicates_and_values (a dict
), respectively.
Like addvalues()
, except that instead of rdflib.Graph.add()
uses setvalue()
to add the new (or existing) predicates and values.
Calculates the difference of two graphs. Returns a tuple of two lists of triples: the ones that would have to added to graph1 and the ones that would have to be retracted from graph1 to make it the same as graph2.
Creates a new dict
using the keys and values of dictionary (a dict
also). The values can be mapped to new values using the Callable
supplied as mapper; if none is specified, values are not mapped, and the function effectively creates a copy of the original dictionary. The new dictionary is returned.
This is just a handy "identity function", i.e., a function that merely returns its single parameter. For some reason one is missing from Python.
Concise Bounded Descriptions (CBDs)
This function computes the so-called Concise Bounded Description of a given RDF resource (node). Parameters:
- source: the input graph (either an
rdflib.Graph
orrdflib.SPARQLStore
or some subclass of theirs). - target: the graph into which the CBD is inserted.
- resource: the node (in the input graph) for which we want to compute the CBD.
- context: specifies the named graph from which the CBD data is pulled; if it is
None
, the default graph is used (note that what this is varies based on, say, which SPARQL database is used). - use_sparql: if
False
(it defaults toTrue
), the CBD is computed recursively, otherwise a non-recursive SPARQL query is used (SPARQL does not support recursive queries). The query probes the graph four levels deep, incrementally; this means that it may fail to compute the full CBD and leave blank nodes as the "leaves" of the CBD tree. - use_describe: if
True
(it defaults toFalse
), the SPARQLDESCRIBE
query is used instead of the fixed-path-length query, but only if context has not been specified.
Note: Some SPARQL databases can compute a CBD via the DESCRIBE
query. The unfortunate limitation of DESCRIBE
is that it cannot be restricted to specific named graphs.
This function calls graphFrom()
to construct a graph from the triples it finds. The remaining keyword arguments are passed to the graph constructor.
Computes a CBD, but uses incoming instead of outgoing edges. The current implementation always uses SPARQL with fixed-length paths (max. 4 hops).
Container manipulation
Tests whether a URI (a string or an rdflib.URIRef
instance) is an RDF container membership
predicate (e.g., rdf:_1
). Returns the index value if it is, or None
otherwise. For example, for rdf:_5
it would return 5
.
Creates a container item predicate URI with the given numerical index. Returns a string, not an rdflib.URIRef
.
Given a node and a property, and assuming the value of this property is an RDF container, returns the "item statements" (e.g., [] rdf:_1 "foo"
) of this container, in sorted order. Parameters:
- graph: the graph containing said container.
- node: a node in the graph.
- predicate: a predicate of the node.
Similar to the function above, but returns a list of the item values only, not the statements.
Given a graph with a node, and a predicate of that node, sets the value of said predicate to a container with the items given. Parameters:
- graph: the graph on which to operate.
- node: a node in that graph.
- predicate: a predicate of that node.
- values: an
Iterable
of RDF terms, to be used as the items of a new container to be created and assigned as the value of the given predicate. If this is an empty list (orNone
), removes the container and the predicate. - newtype: the type of the container to be created (it defaults to
rdf:Seq
).
Specialized graph classes
A "focused graph" is an rdflib.Graph
with a single vertex designated as a "focal point"; access operations are provided that use this vertex as the subject. Keyword arguments are passed to the method findFocus
in case the focus parameter is not provided.
Using Constructor
to build a FocusedGraph
instance:
from rdfhelpers import FocusedGraph, Constructor, graphFrom
from rdflib import BNode, Literal, RDFS
f = BNode() # the "focus" node
g = graphFrom(Constructor("?focus rdfs:label ?label").expand(focus=f, label=Literal("foo")),
graph_class=FocusedGraph, focus=f)
print(g.getvalue(RDFS.label))
will print
foo
RDF term (URIRef
or BNode
) that serves as the default subject of access operations.
Unimplemented in the base class.
Functionally the same as rdfhelpers.getvalue(self, self.focus, predicate)
.
Functionally the same as rdfhelpers.setvalue(self, self.focus, predicate, value)
.
This is a subclass of FocusedGraph
and, when created, contains the CBD of focus (a URIRef
), as computed from the data in data (anything the function cbd()
can use). The parameter context is passed to cbd()
. The other keyword arguments are passed to the superclass constructor.
Uses the diff()
function to calculate the difference between the instance graph and the data where the original CBD was pulled from.
Templated queries
SPARQL does not really have a templated query mechanism. It is possible to tack a VALUES
clause at the end of read queries (SELECT
, CONSTRUCT
, etc.) to "pre-bind" variables; update queries do not allow this. Our templated query mechanism uses string substitution, and thus should be regarded as something that is more of a convenience rather than a performance optimization.
Rather than using SPARQL variables ?var
in templates, we use variables named like this: $var
. Also, unlike the "pre-binding" method, we can substitute variables with plain Python strings in addition to RDF terms, and thus you can substitute anything, even longer passages of query code.
When mixed with a store class that supports the query()
and update()
methods, allows queries to be written as templates where the parameters to substitute take the form $foo
. Anything can be substituted, this is not limited to RDF terms like the initBindings=
parameter used in the overridden methods of RDFLib.
Executes a query using the Templated.query()
method (see below). The keyword arguments are the same as for that method.
Executes an update query using the Templated.update()
method (see below). The keyword arguments are the same as for that method.
The Templated
is the actual implementation of the templated query facility. This class allows queries to be written as templates where the parameters to substitute take the form $foo
. Anything can be substituted, this is not limited to RDF terms like the initBindings
parameter of standard RDFLib methods. The substitutions are passed as keyword arguments; for example, to substitute $foo
with the RDF literal "foobar"
, you would pass foo=rdflib.Literal("foobar")
. The susbstitutions rules are as follows:
- Constant RDF terms (
URIRef
,BNode
, orLiteral
) are converted to their proper SPARQL syntax using the methodTemplated.forSPARQL()
. - Strings are inserted into the template as is; this means that you can actually pass SPARQL variables or larger query passages into the template.
- The value
None
is substituted for an empty string. To pass an "unbound" binding toVALUES
clause, use the string"UNDEF"
. - Numbers and boolean values can be used as such, without first turning them into literals.
Takes the string template and performs substitutions using the bindings supplied as keyword arguments. The new string is returned.
Calls graph.query()
with a query that results from converting the template string using the method Templated.convert()
.
Calls graph.update()
with an update query that results from converting the template string using the method Templated.convert()
.
Converts thing into a SPARQL-compatible representation. See above for the conversion rules.
"Wraps" a graph so that query()
and update()
methods will perform template expansion.
Composable graph operations
We introduce two classes for instantiating RDF data, namely Constructor
and Composable
. The former lets you use SPARQL CONSTRUCT
triple patterns to construct new RDF triples, the latter makes it possible to "cascade" CONSTRUCT
queries and other RDF-generating actions in a "fluent" style.
Fluent API for composing graph construction
A Composable
instance is an object whose methods for manipulating its contained graph always return either the instance itself or a new Composable
instance, enabling "fluent"-style method chaining. The class supports most methods of rdflib.Graph
and can thus be used much the same way.
Allows Composable
instances to be combined (set union of graphs).
Adds triples to the current instance.
Binds a namespace declaration to the current instance.
Creates a graph by parsing data (using the rdflib.Graph.parse()
method).
Serializes the current instance using the rdflib.Graph.serialize()
method.
Creates a new Composable
instance from the results of a SPARQL CONSTRUCT
query applied against the current instance.
Holds the current "result" of the graph construction process (an rdflib.Graph
instance). This property is read-only.
Imitating CONSTRUCT queries
Note: at the moment, this feature is considered "experimental" and the API may still change considerably.
This class is capable of parsing and instantiating SPARQL CONSTRUCT
triple patterns. It is intended to be a shortcut when a full CONSTRUCT
query might be too heavy. A Constructor
instance retains the parsed result from template
(a string) and can be reused without any additional parsing. The parameter ns_mgr
can be used to supply a properly initialized namespace manager (if not supplied, the class has one that gets reused).
Simple example:
from rdflib import URIRef
from rdfhelpers import Constructor
Constructor("?x a rdfs:Class").expand(x=URIRef("http://o.ra/Class1"))
returns a generator of triples; the single triple yielded in this case (serialized as Turtle) would be:
<http://o.ra/Class1> rdf:type rdfs:Class .
The Constructor
class is capable of parsing any CONSTRUCT
triple patterns the normal RDFLib SPARQL parser can.
Instantiates the patterns held by a Constructor
instance, and returns a generator of triples. Variable bindings can be supplied as normal keywords. Any triple that uses a variable for which there is no binding (or the value of the binding is None
) is left out from the result set. If the bound value of a variable is a list (or other iterable), the corresponding triple gets instantiated for each of the items; if there are more than one variable in a triple, with values as iterables, the result is like a cross-product.
Allows new namespaces to be bound to the namespace manager instance held by the class (not an instance, that is). The parameters are the same as for the standard RDFLib method.
Takes a CONSTRUCT
template (a string) and parses it, returns a list of triples where some terms may be instances of rdflib.term.Variable
. A bespoke namespace manager can be supplied.
Expands (i.e., instantiates) a parsed template, supplied as a list of triples. The parameter bindings
must be a dict
.
This class method can be used to parse and instantiate a template without having to create a Constructor
instance. When using this method, the above example would look like this:
from rdflib import URIRef
from rdfhelpers import Constructor
Constructor.parseAndExpand("?x a rdfs:Class",
{"x": URIRef("http://o.ra/Class1")})
Producers
Note: This section is still incomplete.
"Producers" are functionality (packaged as classes) that "produce" RDF data from some other data via a transformation process. Generally speaking, producers are divided into three categories:
- Harvesters: these crawl through data and "harvest" details that can be turned into RDF.
- Mappers: these "map" data from one representation into another (in this case, into RDF).
- Generators: anything that does not really fit in the above two categories.
The categorization is not absolute but merely serves the purpose of making some sense of the large number of different producer classes we have. In broad strokes, to use a producer requires two calls:
- A call to the class constructor (with specific initialization parameters; see below), and
- Any number of calls to the producer's
run()
method (again, with specific parameters).
All producers are built on a general framework of a base class that implements a small number of methods, and generally operate on Composable
instances.
Base classes
- agent: used to specify the "agent" that PROV-O needs (it defaults to
prov:Agent
) - data_class: the Python class from which the producer's data is instantiated (this defaults to
Composable
, but any subclass thereof is also acceptable).
Producer.run()
methods take the following parameters:
- activity: used to specify the "activity" that PROV-O needs (it defaults to
prov:Activity
). - provenance: if
True
(the defailt) causes the producer to record provenance after the data has been generated.
This method calls three other methods (initialize
, produce
, and cleanup
) roughly as follows:
def Producer.run(self, activity=None, provenance=True, **kwargs) -> Composable:
return self.cleanup(self.produce(self.initialize(**kwargs), **kwargs), **kwargs)
That is, the initialize()
method creates a Composable
instance, it gets passed to produce()
which "fills" it, and it gets further passed to cleanup()
which can "clean up" the data if needed; finally, the data is returned as the value of run()
. Each one of the methods can act on the data as it sees fit, using the facilities provided by the fluent interface mechanism of the class Composable
. Only the produce()
method is abstract and must be defined by Producer
subclasses.
Composable
instance and returns it.
Harvester
(it could be a directory in a file system, the URL of a Web site, etc.).
Mapper
.
Mapper
.
Producer
. Note that Generator
subclasses may introduce new keyword parameters to their __init()__
and run()
methods.
Mixin classes
We introduce two "mixin" classes that can be used when subclassing Producer
:
Producer
that need to open and read a file.
- source: typically something that can be passed to the Python function
open()
. - source_kwargs: keyword parameters passed to
open()
. - kwargs: passed to the
run()
method next in the Method Resolution Order (MRO).
- connection_url: a string passed to whatever is the underlying database driver when a connection is created.
- user and passwd: user name and password used in opening a connection. Note that the password is stored in the class instance for its lifetime.
- kwargs: passed to the method next in the MRO.
Subclasses of Harvester
\_\_init\_\_()
and run()
are the same as in the superclass. The parameter root to the run()
method is understood to be a path to the root directory system to be crawled. See below for concrete subclasses.
MinimalGraphProducingFileSystemHarvester
. Customization happens by overriding the method addMetadata()
which gets called for each file, folder, and link (see below).
Composable
instance). The default method adds file creation and modification timestamps. The parameter uri is the URI of the file/folder/link in question.
rdflib.Graph
instance or the pathname of an RDF file that can be parsed into a graph; the graph is assumed to contain the concepts of a SKOS concept scheme. These concepts are matched with the raw metadata's dc:subject
values (by comparing to the concepts' skos:prefLabel
value). The parameter photo_class
is the URI of the RDF class from which photos are instantiated, and the parameter identifier_prop is the URI of the predicate that is used to associate each photograph with an identifier (the file name by default); dc:identier
would be such a predicate, but you are free to use any predicate of your choice (sub-properties of dc:identifer
are recommended).
True
(it defaults to False
) causes the crawler to ignore the rules in the Web site's robots.txt
file.
WebCrawler
that actually collects data from pages it crawls through, and builds a graph. Method signatures are the same.
Subclasses of Mapper
tinyrml
package. Like all mappers, its constructor takes the parameter mapping which in this case can be either an rdflib.Graph
instance containing an RML mapping, or a path to an RDF-parseable file that contains the mapping. The run()
method takes a parameter source which can either be an Iterable[dict]
that is given to the underlying tinyrml
implementation directly, or a path to a file which can be read (by subclasses specializing this) resulting in said iterable.
TinyRMLMapper
where the source parameter of the run()
method is assumed to be a path to a CSV file.
TinyRMLMapper
where the source parameter of the run()
method is assumed to be a pandas.DataFrame
instance.
PandasMapper
where the source parameter of the run()
method is assumed to be a path to a Microsoft Excel document.
xsltproc
) on the source (assumed to be an XML file).
Consumers and API clients
Note: This section is still incomplete.
A "consumer" is an object (an instance of the class Consumer
) that takes a producer as input and then "consumes" its output via the consume()
method. This method can then do what it wants with the produced data (e.g., export it to a file).
Various API clients are provided as a means to building producers that get their data through data APIs.
Discovering and caching node labels
To build responsive user interfaces it may make sense to cache node labels that are to be displayed.
A simple, unbounded cache for labels (values of rdfs:label
and sub-properties thereof). If prepopulate
is true (defaults to false), the cache is seeded from the graph db
using the parameter selector
: this is a SPARQL pattern that selects the subjects.
Returns one of the labels of resource
. In case of a cache miss, the SPARQL query provided to the constructor (the parameter label_query
) is used. Prefers skos:prefLabel
, then skos:altLabel
, then dc:title
or any subproperties of rdfs:label
.
Invalidates the cached label for resource
. Subsequent call to LabelCache.getLabel()
will find a new value.
Populates the cache using the query parameters provided to the class constructor.
Finds all resources whose label contains the string substring
.
Calls findResources()
and constructs a value that can be passed to a jQuery UI "autocomplete" widget. Note that the value has to be "JSONified" first (e.g., by passing it to flask.json.jsonify()
or some equivalent function).
Retrieves all labels and identifiers of resource
(including its URI and possible qname). If exclude_default_namespace
is true (the default), qnames in the default namespace are excluded.
Subclass of LabelCache
for SKOS concept labels. The parameter scheme
is the identifier of a SKOS concept scheme (an instance of skos:ConceptScheme
). All the concepts of this scheme are used to prepopulate the cache.
Finds the concept (instance of skos:Concept
) that has the exact provided label as its skos:prefLabel
. If labels are unique, the following is satisfied.
cache.getConcept(cache.getLabel(resource)) == resource
Produces a list (actually a generator) of all the concepts that are cached.