Is it suboptimal to add the same datoms multiple times?

Question

Is it suboptimal to add the same datoms multiple times?

276 views Asked by AudioBubble At 23 December 2016 at 16:47

I'm currently using Datomic in one of my project, and a question is bothering me.

Here is a simplified version of my problem:

I need to parse a list of small English sentences and insert both the full sentence and its words into Datomic.
the file that contains the list of sentences is quite big (> 10 GB)
the same sentence can occur multiple times in the file, and their words can also occur multiple times across sentences
during the insertion process, an attribute will set to associate each sentence with its corresponding words

To ease the insertion process, I'm tempted to write the same datoms multiple times (i.e. not check if a record already exists in the database). But I'm afraid about the performance impact.

What happens in Datomic when the same datoms are added multiple times ?
Is it worth checking that a datom has already been added prior to the transaction ?
Is there a way to prevent Datomic from overriding previous datoms (i.e if a record already exists, skip the transaction) ?

Thank you for your help

Original Q&A

There are 3 answers

rmcv On 24 December 2016 at 04:15

What happens in Datomic when the same datoms are added multiple times?

If you are adding the word/sentence with a unique identity (:db.unique/identity) then Datomic will keep only one copy of it in the storage (i.e. single entity)

Is it worth checking that a datom has already been added prior to the transaction?

Is there a way to prevent Datomic from overriding previous datoms (i.e if a record already exists, skip the transaction)?*

Again, use :db.unique/identity, then you don't need to query for the entity id to check its existence.

For more information, please refer to here

Alan Thompson On 23 December 2016 at 18:35

You are not at the point you need to worry about pre-optimization like this. Retail computer stores sell hard disks for about $0.05/GB, so you are talking about 50 cents worth of storage here. With Datomic's built-in storage compression, this will be even smaller. Indexes & other overhead will increase the total a bit, but it's still too small to worry about.

As with any problem, it is best to build up a solution incrementally. So, maybe do an experiment with the first 1% of your data and time the simplest possible algorithm. If that's pretty fast, try 10%. You now have a pretty good estimate of how long the whole problem will take to load the data. I'm betting that querying the data will be even faster than loading.

If you run into a roadblock after the first 1% or 10%, then you can think about reworking the design. Since you have built something concrete, you have been forced to think about the problem & solution in more detail. This is much better than hand-waving arguments & whiteboard designing. You now know a lot more about your data and possible solution implementations.

If it turns out the simplest solution won't work at a larger scale, the 2nd solution will be much easier to design & implement having had the experience of the first solution. Very rarely does the final solution spring full-formed from your mind. It is very important for any significant problem to plan on repeated refinement of the solution.

One of my favorite chapters from the seminal book The Mythical Man Month by Fred Brooks is entitled, "Plan To Throw One Away".

**Valentin Waeselynck** · Accepted Answer · 2016-12-26T21:41:49+00:00

What happens in Datomic when the same datoms are added multiple times ?

Is it worth checking that a datom has already been added prior to the transaction ?

Logically, a Datomic database is a sorted set of datoms, so adding the same datom several times is idempotent. However, when you're asserting a datom with a tempid, you may create a new datom for representing the same information as an old datom. This is where :db/unique comes in.

To ensure an entity does not get stored several times, you want to set the :db/unique attribute property to :db.unique/identity for the right attributes. For instance, if your schema consists of 3 attributes :word/text, :sentence/text, and :sentence/words, then :word/text and :sentence/text should be :db.unique/identity, which yields the following schema installation transaction:

[{:db/cardinality :db.cardinality/one,
  :db/fulltext true,
  :db/index true,
  :db.install/_attribute :db.part/db,
  :db/id #db/id[:db.part/db -1000777],
  :db/ident :sentence/text,
  :db/valueType :db.type/string,
  :db/unique :db.unique/identity}
 {:db/cardinality :db.cardinality/one,
  :db/fulltext true,
  :db/index true,
  :db.install/_attribute :db.part/db,
  :db/id #db/id[:db.part/db -1000778],
  :db/ident :word/text,
  :db/valueType :db.type/string,
  :db/unique :db.unique/identity}
 {:db/cardinality :db.cardinality/many,
  :db/fulltext true,
  :db/index true,
  :db.install/_attribute :db.part/db,
  :db/id #db/id[:db.part/db -1000779],
  :db/ident :sentence/words,
  :db/valueType :db.type/ref}]

Then the transaction for inserting inserting looks like:

[{:sentence/text "Hello World!"
  :sentence/words [{:word/text "hello"
                    :db/id (d/tempid :db.part/user)}
                   {:word/text "world"
                    :db/id (d/tempid :db.part/user)}]
  :db/id (d/tempid :db.part/user)}]

Regarding performance:

You may not need to optimize at all, but in my view, the potential performance bottlenecks of your import process are:

time spent building the transaction in the Transactor (which includes index lookups for unique attributes etc.)
time spent building the indexes.

To improve 2.: When the data you insert is sorted, indexing is faster, so an would be to insert words and sentences sorted. You can use Unix tools to sort large file even if they don't fit in memory. So the process would be:

sort sentences, insert them (:sentence/text)
extract words, sort them, insert them (:word/text)
insert word-sentence relationship (:sentence/words)

To improve 1.: indeed, it could put less pressure on the transactor to use entity ids for words that are already stored instead of the whole word text (which requires an index lookup to ensure uniqueness). One idea could be to perform that lookup on the Peer, either by leveraging parallelism and/or only for frequent words (For instance, you could insert the words from the 1st 1000 sentences, then retrieve their entity ids and keep them in a hash map).

Personally, I would not go through these optimizations until experience has shown they're necessary.

TechQA.

Is it suboptimal to add the same datoms multiple times?

There are 3 answers

Regarding performance:

Related Questions in CLOJURE

Related Questions in DATOMIC

Related Questions in DATALOG

Popular Questions

Popular Tags

Trending Questions