Make a shallow copy in data.table

2k views Asked by At

I read in an SO topic an answer from Matt Dowle about a shallow function to make shallow copies in data.table. However, I can't find the topic again.

data.table does not have any exported function called shallow. There is an internal one but not documented. Can I use it safely? What is its behavior?

What I would like to do is a memory efficient copy of a big table. Let DT be a big table with n columns and f a function which memory efficiently adds a column. Is something like that possible?

DT2 = f(DT)

with DT2 being a data.table with n columns pointing to the original adresses (no deep copies) and an extra one existing only for DT2. If yes, what appends to DT1 if I do DT2[, col3 := NULL]?

1

There are 1 answers

6
Matt Dowle On BEST ANSWER

You can't use data.table:::shallow safely, no. It is deliberately not exported and not meant for user use. Either from the point of view of it itself working, or its name or arguments changing in future.

Having said this, you could decide to use it as long as you can either i) guarantee that := or set* won't be called on the result either by you or your users (if you're creating a package) or ii) if := or set* is called on the result then you're ok with both objects being changed by reference. When shallow is used internally by data.table, that's what we promise ourselves.

More background in this answer a few days ago here : https://stackoverflow.com/a/45891502/403310

In that question I asked for the bigger picture: why is this needed? Having that clear would help to raise the priority in either investigating ALTREP or perhaps doing our own reference count.

In your question you alluded to your bigger picture and that is very useful. So you'd like to create a function which adds working columns to a big data.table inside the function but doesn't change the big data.table. Can you explain more why you'd like to create a function like that? Why not load the big data.table, add the ephemeral working columns directly to it, and then proceed. Your R session is already a working copy in memory of the data which is persistent somewhere else.

Note that I am not saying no. I'm not saying that you don't have a valid reason. I'm asking to discover more about that valid reason so the priority can be raised.

If that isn't the answer you had seen, there are currently 39 question or answers returned by the search string "[data.table] shallow". Worst case, you could trawl through those to find it again.