How do I effectively Stream large files while calculating MD5 and file size?

57 views Asked by At

I have a service where I'm receiving files from a client server and then I'm supposed to upload the files to my Cloudflare directory. As far as I understand, there is 2 streams going on here. One is from the client to the service, and the other one is from the service to Cloudflare.

It should be possible to upload files up to 15 GB, and in order to support that and avoid memory overload, I want to stream the files in small chunks rather than uploading the full file at once. This goes for both streams (Client -> service and service -> Cloudflare).

So while I'm streaming the file, I want to calculate the MD5 and file size also chunk by chunk.

This is what I've tried so far:

I have this AwsConfig class:

@Configuration
public class AwsConfig {

    @Value("***********")
    private String accessKey;
    @Value("***********")
    private String secretKey;
    @Value("***********")
    private String endpoint;

    @Bean
    public AmazonS3 amazonS3() {
        if (Strings.isNullOrEmpty(endpoint)) throw new RuntimeException("needs s3 endpoint");
        // remove trailing slash
        var s3Url = endpoint.replaceAll("/$", "");

        var credentials = new BasicAWSCredentials(accessKey, secretKey);
        var clientConfiguration = new ClientConfiguration();
        //clientConfiguration.setSignerOverride("AWSS3V4SignerType");
        var endpointConfiguration = new AwsClientBuilder.EndpointConfiguration(s3Url, "auto");
        return AmazonS3ClientBuilder
                .standard()
                .withEndpointConfiguration(endpointConfiguration)
                .withPathStyleAccessEnabled(true)
                .withClientConfiguration(clientConfiguration)
                .withCredentials(new AWSStaticCredentialsProvider(credentials))
                .build();
    }

    @Bean
    public TransferManager transferManager(AmazonS3 amazonS3) {
        TransferManagerBuilder builder = TransferManagerBuilder.standard()
                .withS3Client(amazonS3)
                .withMultipartUploadThreshold(50L * 1024 * 1024)  // Start multipart upload for files over 50MB
                .withExecutorFactory(() -> Executors.newFixedThreadPool(10));  // Limit the thread pool size

        return builder.build();
    }
}

Then I have this endpoint that receives the file from the client and streams it to Cloudflare:

@PostMapping("/file/upload")
@Operation(summary = "upload file")
public ResponseEntity<?> uploadFile(@RequestPart("file") MultipartFile file,
                                    @RequestPart("data") UploadUrl url) {
    try {
        if (file.isEmpty()) {
            return ResponseEntity.badRequest().body("File is empty");
        }

        String uploadUrl = url.getUrl();

        String[] uploadUrlSplit = uploadUrl.split("/");
        String bucket = uploadUrlSplit[0];
        String packageFileURL = String.join("/", Arrays.copyOfRange(uploadUrlSplit, 1, uploadUrlSplit.length));

        ObjectMetadata metadata = new ObjectMetadata();
        metadata.setContentLength(file.getSize());

        // Use TransferManager to upload the file
        Upload upload = transferManager.upload(bucket, packageFileURL, file.getInputStream(), metadata);

        upload.waitForCompletion(); // Wait for the upload to complete

        return ResponseEntity.ok().body("File uploaded successfully");
    } catch (Exception e) {
        return ResponseEntity.internalServerError().body("Upload failed: " + e.getMessage());
    }
}

This works on and off, it takes 20-25 minutes to upload a 10 GB file. Sometimes it times out, and I can't figure out how to do the MD5 and file size calculation without buffering the whole file in memory..

I've tried to use TeeInputStream, but I didn't manage to get it to work. If I remove this line upload.waitForCompletion(); // Wait for the upload to complete, then it can upload a 15 GB file in under 2 minutes, the memory sky-rockets..

The way I want it to work is to first and foremost to make sure that the file is streamed to the upload server just as fast as it is uploaded to the Cloudflare S3, this is to make sure that the client can display an accurate number of how far the upload is.

Secondly I want to clone the stream into two streams (i can use TeeInputStream for this), where one stream uploads, and the other stream calculates the MD5 using inputstream digest, so I don't have to store anything in the memory.

Please, how do I solve this?

0

There are 0 answers