11. Data Import#
CubicWeb is designed to easily manipulate large amounts of data, and provides utilities to make imports simple.
The main entry point is cubicweb.dataimport.importer
which defines an
ExtEntitiesImporter
class responsible for importing data from an external source in the
form ExtEntity
objects. An ExtEntity
is a transitional representation of an
entity to be imported in the CubicWeb instance; building this representation is usually
domain-specific â e.g. dependent of the kind of data source (RDF, CSV, etc.) â and is thus the
responsibility of the end-user.
Along with the importer, a store must be selected, which is responsible for insertion of data into the database. There exists different kind of stores, allowing to insert data within different levels of the CubicWeb API and with different speed/security tradeoffs. Those keeping all the CubicWeb hooks and security will be slower but the possible errors in insertion (bad data types, integrity error, âŠ) will be handled.
11.1. Example#
Consider the following schema snippet.
class Person(EntityType):
name = String(required=True)
class knows(RelationDefinition):
subject = 'Person'
object = 'Person'
along with some data in a people.csv
file:
# uri,name,knows
http://www.example.org/alice,Alice,
http://www.example.org/bob,Bob,http://www.example.org/alice
The following code (using a shell context) defines a function extentities_from_csv to read
Person external entities coming from a CSV file and calls the ExtEntitiesImporter
to
insert corresponding entities and relations into the CubicWeb instance.
from cubicweb.dataimport import ucsvreader, RQLObjectStore
from cubicweb.dataimport.importer import ExtEntity, ExtEntitiesImporter
def extentities_from_csv(fpath):
"""Yield Person ExtEntities read from `fpath` CSV file."""
with open(fpath) as f:
for uri, name, knows in ucsvreader(f, skipfirst=True, skip_empty=False):
yield ExtEntity('Person', uri,
{'name': set([name]), 'knows': set([knows])})
extenties = extentities_from_csv('people.csv')
store = RQLObjectStore(cnx)
importer = ExtEntitiesImporter(schema, store)
importer.import_entities(extenties)
commit()
rset = cnx.execute('String N WHERE X name N, X knows Y, Y name "Alice"')
assert rset[0][0] == u'Bob', rset
11.2. Importer API#
Data import of external entities.
Main entry points:
- class cubicweb.dataimport.importer.ExtEntitiesImporter(schema, store, extid2eid=None, existing_relations=None, etypes_order_hint=(), import_log=None, raise_on_error=False)[source]#
This class is responsible for importing externals entities, that is instances of
ExtEntity
, into CubicWeb entities.- Parameters
schema â the CubicWebâs instance schema
store â a CubicWeb Store
extid2eid â optional {extid: eid} dictionary giving information on existing entities. It will be completed during import. You may want to use
cwuri2eid()
to build it.existing_relations â optional {rtype: set((subj eid, obj eid))} mapping giving information on existing relations of a given type. You may want to use
RelationMapping
to build it.etypes_order_hint â optional ordered iterable on entity types, giving an hint on the order in which they should be attempted to be imported
import_log â optional object implementing the
SimpleImportLog
interface to record events occuring during the importraise_on_error â optional boolean flag - default to false, indicating whether errors should be raised or logged. You usually want them to be raised during test but to be logged in production.
Instances of this class are meant to import external entities through
import_entities()
which handles a stream ofExtEntity
. One may then plug arbitrary filters into the external entities stream.
- class cubicweb.dataimport.importer.ExtEntity(etype, extid, values=None)[source]#
Transitional representation of an entity for use in data importer.
An external entity has the following properties:
extid
(external id), an identifier for the ext entity,etype
(entity type), a string which must be the name of one entity type in the schema (eg.'Person'
,'Animal'
, âŠ),values
, a dictionary whose keys are attribute or relation names from the schema (eg.'first_name'
,'friend'
), and whose values are sets. For attributes of type Bytes, byte strings should be inserted in values.
For instance:
ext_entity.extid = 'http://example.org/person/debby' ext_entity.etype = 'Person' ext_entity.values = {'first_name': set([u"Deborah", u"Debby"]), 'friend': set(['http://example.org/person/john'])}
Utilities:
- cubicweb.dataimport.importer.cwuri2eid(cnx, etypes, source_eid=None)[source]#
Return a dictionary mapping cwuri to eid for entities of the given entity types and / or source.
- class cubicweb.dataimport.importer.RelationMapping(cnx, source=None)[source]#
Read-only mapping from relation type to set of related (subject, object) eids.
If source is specified, only returns relations implying entities from this source.
- cubicweb.dataimport.importer.use_extid_as_cwuri(extid2eid)[source]#
Return a generator of
ExtEntity
objects that will set cwuri using entityâs extid if the entity does not exist yet and has no cwuri defined.extid2eid is an extid to eid dictionary coming from an
ExtEntitiesImporter
instance.Example usage:
importer = ExtEntitiesImporter(cnx, store, import_log) set_cwuri = use_extid_as_cwuri(importer.extid2eid) importer.import_entities(set_cwuri(extentities))
11.2.1. Stores#
Stores are responsible to insert properly formatted entities and relations into the database. They have the following API:
>>> user_eid = store.prepare_insert_entity('CWUser', login=u'johndoe')
>>> group_eid = store.prepare_insert_entity('CWUser', name=u'unknown')
>>> store.prepare_insert_relation(user_eid, 'in_group', group_eid)
>>> store.flush()
>>> store.commit()
>>> store.finish()
Some store requires a flush to copy data in the database, so if you want to have store independant code you should explicitly call it. (There may be multiple flushes during the process, or only one at the end if there is no memory issue). This is different from the commit which validates the database transaction. At last, the finish() method should be called in case the store requires additional work once everything is done.
prepare_insert_entity(<entity type>, **kwargs) -> eid
: given an entity type, attributes and inlined relations, return the eid of the entity to be inserted, with no guarantee that anything has been inserted in database,prepare_update_entity(<entity type>, eid, **kwargs) -> None
: given an entity type and eid, promise for update given attributes and inlined relations with no guarantee that anything has been inserted in database,prepare_insert_relation(eid_from, rtype, eid_to) -> None
: indicate that a relationrtype
should be added between entities with eidseid_from
andeid_to
. Similar toprepare_insert_entity()
, there is no guarantee that the relation will be inserted in database,flush() -> None
: flush any temporary data to database. May be called several times during an import,commit() -> None
: commit the database transaction,finish() -> None
: additional stuff to do after import is terminated.
- class cubicweb.dataimport.stores.NullStore[source]#
Store that mainly describe the store API.
It may be handy to test input data files or to measure time taken by steps above the store (e.g. data parsing, importer, etc.): simply give a
NullStore
instance instead of the actual store.Stores can also be used as context manager. If no exception is raised during the import, a final flush and the finish method are called. On the contrary, is something went wrong, we roll everything back.
- class cubicweb.dataimport.stores.RQLObjectStore(cnx)[source]#
Store that works by making RQL queries, hence with all the cubicwebâs machinery activated.
- class cubicweb.dataimport.stores.NoHookRQLObjectStore(cnx, metagen=None)[source]#
Store that works by accessing low-level CubicWebâs source API, with all hooks deactivated. It may be given a metadata generator object to handle metadata which are usually handled by hooks.
Arguments: - cnx, a connection to the repository - metagen, optional
MetadataGenerator
instance
- class cubicweb.dataimport.stores.MetadataGenerator(cnx, baseurl=None, source=None, meta_skipped=())[source]#
Class responsible for generating standard metadata for imported entities. You may want to derive it to add application specificâs metadata. This class (or a subclass) may either be given to a nohook or massive store.
Parameters: * cnx: connection to the repository * baseurl: optional base URL to be used for cwuri generation - default to config[âbase-urlâ] * source: optional source to be used as cw_source for imported entities
11.3. MassiveObjectStore#
This store relies on COPY FROM sql commands to directly push data using SQL commands
rather than using the whole CubicWeb API. For now, it only works with PostgreSQL as it requires
the COPY FROM command. Anything related to CubicWeb (Hooks, for instance), are bypassed. It
inserts entities directly by using one PostgreSQL COPY FROM
query for a set of similarly
structured entities.
This store is the fastest, if the table is small compared to the volume of data to insert. Indeed, it removes all indexes and constraints on the table before importing, and reapply them at the end. This means that if the table is small compared to the amount of data you want to insert, this store is better than the others.
NOTE: Because inlined 1 relations are stored in the entityâs table, they must be set as any other attributes of the entity. For instance:
store.prepare_insert_entity("MyEType", name="toto", favorite_email=email_address.eid)
- 1
An inlined relation is a relation defined in the schema with the keyword argument
inlined=True
. Such a relation is inserted in the database as an attribute of the entity whose subject it is.
- class cubicweb.dataimport.massive_store.MassiveObjectStore(cnx, slave_mode=False, eids_seq_range=10000, metagen=None, drop=True)[source]#
Store for massive import of data, with delayed insertion of meta data.
WARNINGS:
This store may only be used with PostgreSQL for now, as it relies on the COPY FROM method, and on specific PostgreSQL tables to get all the indexes.
This store can only insert relations that are not inlined (i.e., which do not have inlined=True in their definition in the schema), unless they are specified as entity attributes.
It should be used as follows:
store = MassiveObjectStore(cnx) eid_p = store.prepare_insert_entity(âPersonâ,
cwuri=uâhttp://dbpedia.org/totoâ, name=uâTotoâ)
- eid_loc = store.prepare_insert_entity(âLocationâ,
cwuri=uâhttp://geonames.org/11111â, name=uâSomewhereâ)
store.prepare_insert_relation(eid_p, âlives_inâ, eid_loc) store.flush() ⊠store.commit() store.finish()
Full-text indexation is not handled, youâll have to reindex the proper entity types by yourself if desired.
Create a MassiveObject store, with the following arguments:
cnx, a connection to the repository
metagen, optional
MetadataGenerator
instanceeids_seq_range: size of eid range reserved by the store for each batch