How do I un-gzip a file without saving it?

246 views Asked by At

I am new to rust and I am trying to port golang code that I had written previosuly. The go code basically downloaded files from s3 and directly (without writing to disk) ungziped the files and parsed them.

Currently the only solution I found is to save the gzipped files on disk then ungzip and parse them.

Perfect pipeline would be to directly ungzip and parse them.

How can I accomplish this?

const ENV_CRED_KEY_ID: &str = "KEY_ID";
const ENV_CRED_KEY_SECRET: &str = "KEY_SECRET";
const BUCKET_NAME: &str = "bucketname";
const REGION: &str = "us-east-1";

use anyhow::{anyhow, bail, Context, Result}; // (xp) (thiserror in prod)
use aws_sdk_s3::{config, ByteStream, Client, Credentials, Region};
use std::env;
use std::io::{Write};
use tokio_stream::StreamExt;

#[tokio::main]
async fn main() -> Result<()> {
    let client = get_aws_client(REGION)?;

    let keys = list_keys(&client, BUCKET_NAME, "CELLDATA/year=2022/month=06/day=06/").await?;
    println!("List:\n{}", keys.join("\n"));

    let dir = Path::new("input/");
    let key: &str = &keys[0];
    download_file_bytes(&client, BUCKET_NAME, key, dir).await?;
    println!("Downloaded {key} in directory {}", dir.display());

    Ok(())
}

async fn download_file_bytes(client: &Client, bucket_name: &str, key: &str, dir: &Path) -> Result<()> {
    // VALIDATE
    if !dir.is_dir() {
        bail!("Path {} is not a directory", dir.display());
    }

    // create file path and parent dir(s)
    let mut file_path = dir.join(key);
    let parent_dir = file_path
        .parent()
        .ok_or_else(|| anyhow!("Invalid parent dir for {:?}", file_path))?;
    if !parent_dir.exists() {
        create_dir_all(parent_dir)?;
    }
    file_path.set_extension("json");
    // BUILD - aws request
    let req = client.get_object().bucket(bucket_name).key(key);

    // EXECUTE
    let res = req.send().await?;

    // STREAM result to file
    let mut data: ByteStream = res.body;
    let file = File::create(&file_path)?;
    let Some(bytes)= data.try_next().await?;
    let mut gzD = GzDecoder::new(&bytes);
    let mut buf_writer = BufWriter::new( file);
    while let Some(bytes) = data.try_next().await? {
        buf_writer.write(&bytes)?;
    }
    buf_writer.flush()?;

    Ok(())
}

fn get_aws_client(region: &str) -> Result<Client> {
    // get the id/secret from env
    let key_id = env::var(ENV_CRED_KEY_ID).context("Missing S3_KEY_ID")?;
    let key_secret = env::var(ENV_CRED_KEY_SECRET).context("Missing S3_KEY_SECRET")?;

    // build the aws cred
    let cred = Credentials::new(key_id, key_secret, None, None, "loaded-from-custom-env");

    // build the aws client
    let region = Region::new(region.to_string());
    let conf_builder = config::Builder::new().region(region).credentials_provider(cred);
    let conf = conf_builder.build();

    // build aws client
    let client = Client::from_conf(conf);
    Ok(client)
}
1

There are 1 answers

2
Lucas S. On

Your snippet doesn't tell where GzDecoder comes from, but I'll assume it's flate2::read::GzDecoder.

flate2::read::GzDecoder is already built in a way that it can wrap anything that implements std::io::Read:

  • GzDecoder::new expects an argument that implements Read => deflated data in
  • GzDecoder itself implements Read => inflated data out

Therefore, you can use it just like a BufReader: Wrap your reader and used the wrapped value in place:

use flate2::read::GzDecoder;
use std::fs::File;
use std::io::BufReader;
use std::io::Cursor;

fn main() {
    let data = [0, 1, 2, 3];
    // Something that implements `std::io::Read`
    let c = Cursor::new(data);
    
    // A dummy output
    let mut out_file = File::create("/tmp/out").unwrap();

    // Using the raw data would look like this:
    // std::io::copy(&mut c, &mut out_file).unwrap();
    
    // To inflate on the fly, "pipe" the data through the decoder, i.e. wrap the reader
    let mut stream = GzDecoder::new(c);
    
    // Consume the `Read`er somehow
    std::io::copy(&mut stream, &mut out_file).unwrap();
}

playground

You don't mention what "and parse them" entails, but the same concept applies: If your parser can read from an impl Read (e.g. it can read from a std::fs::File), then it can also read directly from a GzDecoder.