Haven: read_dta error() "Failed to parse /Users/folder/my_data.dta: Unable to allocate memory."

4k views Asked by At

I have a relatively large .dta file, with 1280000 observations, that works fine in Stata, but I am having troubles importing it into R. The data was created with Stata 15, the data contains strL or str#, #>244 variables and cannot be saved in the Stata 12 format.

I am trying to use the haven package to import the saved data using read_dta() but it is giving me the following error message: "Failed to parse /Users/folder/my_data.dta: Unable to allocate memory." Does anyone know what might be causing this problem and how to overcome it to be able to import the data in R?

I have attempted to overcome the issue in multiple ways, but none of my attempts appears to work.

  1. First I tried to expand the memory size of my r environment using Sys.setenv('R_MAX_VSIZE'=32000000000) but the console reports the same error when I try to import the data. The problem does not appear to relate to the size of my memory in R.

  2. I tried to save the data in Stata 13 format, using saveold my_data13, version(13) in Stata, but trying to import it into R with haven still produces the same error message.

  3. I tried to use readstata13 function read.dta13(my_data13), but this regularly end up in R crashing.

The strange thing is that I am able to open the data correctly in Stata, by simply double-clicking on it.

Does anyone has any suggestion on how to address this issue? Any insight on a) the meaning of the error message and how to address it 2) alternative packages able to ope stata15 files c) approach to be able to open the data in R would be most welcome.

Thanks a lot in advance for your help

Best Regards

1

There are 1 answers

3
Álvaro A. Gutiérrez-Vargas On BEST ANSWER

I just want to wrap up everything mentioned in the comments.

From the official Stata documentation, we have the following:

strL variables can be 0 to 2-billion bytes long. strL variables are not required to be longer than 2,045 bytes. str# variables can store strings of up to 2,045 bytes, so strL and str# overlap. This overlap is comparable to the overlap of the numeric types int and float. Any number that can be stored as an int can be stored as a float. Similarly, any string that can be stored as a str#, can be stored as a strL. The reverse is not true. In addition, strL variables can hold binary strings, whereas str# variables can only hold text strings. Thus the analogy between str#/strL and int/float is exact. There will be occasions when you will want to use strL variables in preference to str# variables, just as there are occasions when you will want to use float variables in preference to int variables.

The above information makes me infer two points before I try to replicate your problem.

  1. In your case, the use of strL was not necessary.

  2. The use of strL was the source of your problem, with a probable result some compatibility problem with the haven library.

However, after trying to replicate what you described, I arrived at a different conclusion.


Please gently consider the following code that emulates your problem.

version 16
clear all 
set obs 1
*4200 character string generated here: http://www.unit-conversion.info/texttools/random-string-generator/
gen strL str_var_in_strL = "eSw0qZcVs5DHU2GxgRo1Seo9uTwJ0MvHXyYUQidJMRWw8KW1310Ec242O6D4xrLziO4c56WgluSddTy0Q64QapkwGgOMZdy8ru0fyss0nwJvF4M3kBjYGF00ZsvQGYt4DjF51R3vxTzUx4xlApKwaoRADIgFlXvBh2Bug0VVhmXR3uInHDfpmID57kVWiyxX1gELdyPMVzJWizEHVx2GpjBsm1UdRphDdukFtFrnkr1HFRXBekxHkW3uOCHz0wnyDBfwitDGHosctRrWPhIjujnoalOaHkI5jbnENSNJEsOdGohoe5QKZIxtXmVbD4l8m8wLCbuSjZLw8NzU5vjPX57T2yWWasdFMIHk3kFipT0CG3dNForECS8UiW6ZWSIEmO2V62uakfrxTsRb9fIFVBUIHpGizeR0b27OnfSVB2wE2Ix0ij7kR19jz0wIh35fbwkJWqLq93pfHEtGu0FTb8H5A4XNOcR8chEAQBI7zV3rosSGnSP2h9QZtuSAcz1TrRHNMpCguvNf1DD72TCfCaiBXyflOCre7f5zchLA7k2cQ5qi4fBMVc9GnAdGB2vnjFeFlwaUD0AEUhfSJJINRQ2CKfJegqUL0jBgHBVy5cYCNxsP8Gu8NXRUo6vvyiTJMDcBkL0JKNOT4usSDi4v86cJNzQQa3ArafRzOv1RFz8BfI7pP7rXDLD6d1Z1miCqTZ8UtJBVQ0Z0eCQmTrlAvlu5busOjcAl4ZV7THH6qCV8tI53zh1THBfjnEgoPxy8UIaIK6tXDUM4RFMMd1366324mJEVwyvc5CWgzPian39Q3GFLl6zXCfD4pw7rSUmH5CNOmKgPihxPbV9NSBxiwVK3M07KFS2hZbf3ZDB4CBSJV9geFWKZlR3XNrsPudQgkpsdywNNjZTDwD2RiHF7kQAgyEW7q1w42OC2IbreBBtiPekx6yzCEWBEokLwfhrhbOnDwcnFmfKjnrxCbqypXrSnyvrUP2nUQ9vBmdxCqiVLrBHuDi6Wv2U4vyZ7dTqk84WmnwACXo5PbYY2dmhtjscLMpRw4Q6xVUEWC3qPMnQkbI1UKEq1NfOrF0X8nC0rqrwHQuNuJqHuebJj5AMXVgyZWTaqYIb4gkbGaEze3wNmHbbj1q2bmumiwd6RZRSdx7U3ZwozO9kTkZ69NHFSa2QDi8GrhgvBDMshJVaOR9K8tWcpa2QrFD7cI0ZqzneLXHXm6LsOmtZPFmikKfyts1pASGwZ8DzuWfT3j0daNmyk6y0HwHwM98KOyeuSXnQJOJzunAXkidv90hrgviWUhP70Nrx527JI7vpRr2dClBBnzO2O7YwjTdKTrmfcPs9z4iLeroo20Jg9ODHjvUYWtLRTOKrgvYAgywkj2PoVdwmuYK32UKcqH5EdhPHxWarjsqUuBb40u6nUGIQ0YS7ZuzsnVDerB8hO3rCl0FlMMYgRh4vdPcEG23JQoIwTvdujULg6Lpplyt6yK16UhVSklj6aVNIoA4zr51dULyOzWF9ZqlZz7l90QpXvLuRD9Elr5gxWvNW4fvkCAU3kpEv0s7gHS7ytjNxm0WLk74bN1iP8ZjcxXXBqwtatCo9e1Ayc59VYR9RVxtfvilb038WpHglhWEZoK91rumPSFiCJWUmlkL6P4SAbz5b6LDdW9ybiN8zdZmNtQ2px556d7DF5RRcXgLocLH37Uh6uU9cz2wmWRrcJS4rO9MkUe6KSuVjVLXSsk6J1bnvvagWl4BkY8ZPm0iBg4XXTkRAjfVgnfex1hee47b6k9c5gdS6AJSVazCPpXQJlGJ7NpyAn3hXdHkhaGtokTmne6Zag8DterOyDldPXHXwrG7PgtsmREc0VugLVPrYEbdf9QMHBtGQLwQz04Gyg2lspZ5HbGnOkfI0MTanuMN7XnWdcGBko9gmQKbpONgPqg8POcpxG2aRefswG090hvYKj5gzp3r1nitZZhBm8KDUT8P2Wy06hPxrkZinMGmBv2SIDegXr5uzceHymEnyMQZINS96QCyTiV7z1X1NZ3IBDfVPZTZ9bRxpKyMbAnYzFhx9PYSkescyMMtsGOEli1gFp2PWcqO4bpj0EnKjgWf9ae2R5nDKIkVbsNRCik3JrCM7WjHPfwdZSiA335Eyl0yoHQWjp6YJrR8ykOtw3zL2XHa2ilKIRSypG5dtDwjuqLI1fb7fB5wiG3LuowKqam8HY86aDsuu0DkpED9mAxoSvE7V6WPs5ptg31yoUOgGK1rvGtdpY7CHkaBmmv0jKNYjcZiuET6Q0If2IO36HafXJN8onjMvYadAypEY4IAxkmU2yemCFQkdDBuhr8G4DWcGO46W24QOvMghH6k3HVHeDgj5dNXKIz3rqbCC6tSNMptuQhoM2eWnwBFw6PzNIqKwB6qxbJMs9wxqvDEJlQkKgMZN4HJu1hNFpXIePLN8dSsV8xhgb1pxwFaqhLQMoYXcgobOdcrb7DDpJFbWhUXKn3WHEjO8nk4EAuNmIUdyfWxwxPPmTLEyohT7QrcjvRph82n64aRJIcyDPCho8pqtCTve84PAp4jIeechI8sl92e94jsX7XTZu3LaqDGkEEtcmp69ZqPA5Ev9NZv5ovmWNiu39kKu7QW0YnXGCvorirdScdCow4NyLgnpoAEG0FPz52oN7xU5xyTgY0x5Hel5GDsA28Qoy463tyBNuT51gQROr9XqgiM9Voq0ax1vI8QKFMLXwaLtHwPC7TkFtJbXYNmPt2kXzQ1EjLq0DGiyKd0BFMww3zpEjeSUS7KhCrf5qU7aDjIkVniPs6TGkTBOwG9ItrUv50WJfqgOd6ngHfYWzJFIAgZnGjXhtkHartO3F19iPs5VHRhTEUw2HZbgnTjmf2NmJ1onUkMNSFUMPrNkrfxl78GHNjJrGAkRU0jlXqAuIK8v6uYh4oyqSrFtzPru8GNIWCbka6LKLrMcCysoDk7VQI12xzELxUebiUsnLYCrnvmJtD0T7yv9M8H8rI4YsdbzD3forc56uvwqS0h0Bl6Sw71n781EC0R0V6067RA2TRZ39fs7yXYZ4O5pQ1uHn0qV82aZI1kHxWVJ1omu81KqoFpTnB0QuNd62AeVKRiuMiAf2UFhy40vFgFElRZFipH1TrJAuFYcgwd38kJWTGTYyW51z1DQfVZnlIegEfZQPTjnhFroayXw55MKnJGiVPQ4R0A2nO5LCDmDFmC8SyLz62pn0aQb5tvlQs5Es7woSf3SbxcKh9JndW42V8hQn4uXxbEKhAX9f48VJg5xXYsMOaBz4h0UfsOrOucFH3YVA9c7TVszXbSq7kRQkpsy3xkunDaIfdiAx5wdLE7LbPhUwrC1FWnCa2qQQxNUimxrQ35Woar9tNQSwpVd8ybEaivgQ77HPSYjTdkKX2j2TBCzmGVBesOUnWI9r34kRO5xPsPPDoJvLQe6kns75Yjcuz82OEuUai1PLVRKmzkRjyLp3tt5YDkjzuYCOWNchY3Eup1IEDvGf64wu4S1qLvRrl6HI9jZj7Li2GZc9grCTxbqpUQgCbCxdgmS6a396AJNijmG8uNnchGPlnNVm6DskG7T2pWasVuuhYhkyFNoUWuY5mBXurDMEDyyZPxlY9nlQYKHBNgg6ZnNEYnwCqTLzudDBQ48YG9r3700uvz83jJAX18s2Kjm2LlmOuPJON6rzbPua8Ac2Y0HPuQZD9Ikcim2MOyR9mbvtRTPeLAX3issevCDYaBiG6BFMaN9rW7j1UnlKQZYgTCveE4oH8tT7QGwdWENAjW4kGjS93zCS6QYxyUjg03er43KivMQOVHaT3iznZnQD3Nk5c0T9IKqRpcytY7JaRV7kmayUmKc4d1ApFqY8imlu2iTMiVfY16qMqDeulTtcKjKUuyWBrJSENwv238nWXQudShLeCsiwUMUnJvXyHdSsmAaaoG5O3RA4GkiQVAiX63tWPl6GNfweFAcpxoD4x8hZpbQ1SaBo3pRNwwHuAvzOwm0jKWndfugKqlUmPoDZb9Bx6dzDolUJtHSNYVrOACFY26SyeyeiFHnnd6wZFypkDGL4LgeBbU8TqJTjO2lFXVmCQZjTnO75V43vJKoHUrRSUkZJYTyl3c8tqWJmrUZ7lJo4cUrGzoudgzDHH8N8H73TwXboF8rbQ8i4yb9w51L0EnCd3kdmSXI0PJxuQ9CQ8AD9WKTwvEaSWDTZ"

describe

Contains data
  obs:             1                          
 vars:             1                          
------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------
str_var_in_strL strL    %9s                   
------------------------------------------------------------------------------------

save "route\all_in_strL" , replace

Afterward, I applied your modification to the data I generated.

gen var1 = str_var_in_strL

foreach var of varlist var1 {
     generate str `var'_str = `var' 
     replace `var' = "" 
     compress `var' 
     replace `var' = `var'_str  
     drop `var'_str  
     describe `var' 
}

  obs:             1                          
 vars:             2                          16 Dec 2020 12:07
------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------
str_var_in_strL strL    %9s                   
var1            strL    %9s                   
------------------------------------------------------------------------------------

save "route\edited_strL" , replace 

Weirdly enough, I was able to import both files into R using the haven library with the following code.

library(haven)
file <- "route_to_first_file/all_in_strL.dta"
# Import first file into R.

dta_in_R <- read_dta(
                    file,
                    encoding = NULL,
                    col_select = NULL,
                    skip = 0,
                    n_max = Inf,
                    .name_repair = "unique")



# Import edited file using your loop method into R.
file <- "route_to_edited_file/edited_strL.dta"

edited_dta_in_R<- read_dta(
                          file,
                          encoding = NULL,
                          col_select = NULL,
                          skip = 0,
                          n_max = Inf,
                          .name_repair = "unique")


The only differences here could be:

  1. The size of the dataset (just one observation in my case).
  2. The possibility that you may be using an earlier version of Stata (I am using Stata 16).

Finally, I think the source of the problem wasn't the strL type of the data, but memory available on your machine, which was probably solved by your compress step in the for loop you described.


PS: Everything was run on Win10. R version 4.0.3 (2020-10-10) and haven_2.3.1