I'm currently using Datomic in one of my project, and a question is bothering me.
Here is a simplified version of my problem:
- I need to parse a list of small English sentences and insert both the full sentence and its words into Datomic.
- the file that contains the list of sentences is quite big (> 10 GB)
- the same sentence can occur multiple times in the file, and their words can also occur multiple times across sentences
- during the insertion process, an attribute will set to associate each sentence with its corresponding words
To ease the insertion process, I'm tempted to write the same datoms multiple times (i.e. not check if a record already exists in the database). But I'm afraid about the performance impact.
- What happens in Datomic when the same datoms are added multiple times ?
Is it worth checking that a datom has already been added prior to the transaction ?
Is there a way to prevent Datomic from overriding previous datoms (i.e if a record already exists, skip the transaction) ?
Thank you for your help
Logically, a Datomic database is a sorted set of datoms, so adding the same datom several times is idempotent. However, when you're asserting a datom with a tempid, you may create a new datom for representing the same information as an old datom. This is where
:db/unique
comes in.To ensure an entity does not get stored several times, you want to set the
:db/unique
attribute property to:db.unique/identity
for the right attributes. For instance, if your schema consists of 3 attributes:word/text
,:sentence/text
, and:sentence/words
, then:word/text
and:sentence/text
should be:db.unique/identity
, which yields the following schema installation transaction:Then the transaction for inserting inserting looks like:
Regarding performance:
You may not need to optimize at all, but in my view, the potential performance bottlenecks of your import process are:
To improve
2.
: When the data you insert is sorted, indexing is faster, so an would be to insert words and sentences sorted. You can use Unix tools to sort large file even if they don't fit in memory. So the process would be::sentence/text
):word/text
):sentence/words
)To improve
1.
: indeed, it could put less pressure on the transactor to use entity ids for words that are already stored instead of the whole word text (which requires an index lookup to ensure uniqueness). One idea could be to perform that lookup on the Peer, either by leveraging parallelism and/or only for frequent words (For instance, you could insert the words from the 1st 1000 sentences, then retrieve their entity ids and keep them in a hash map).Personally, I would not go through these optimizations until experience has shown they're necessary.