Live Preview of Apple's Vision API's

Question

Live Preview of Apple's Vision API's

88 views Asked by samkass At 22 November 2023 at 12:32

I am trying to write an iOS app which scans documents for processing with Apple's Vision API. The goal is to have the live video feed display what the camera sees on the screen, with the outlines of a document highlighted and filled in with a semi-transparent color, and have it track the video as the phone moves around until the user has it framed the way they like and take the snapshot.

To do this I create 3 things: an AVCaptureVideoDataOutput to get the frame buffers for analysis, an AVCaptureVideoPreviewLayer to display the video preview, and a CAShapeLayer to show the outline/fill. Everything seems to be working great-- I get a frame, I kick off a VNDetectDocumentSegmentationRequest in the background to quickly get a candidate quadrilateral, and I throw a task on the main DispatchQueue to update the shape layer.

However, the coordinate systems do not line up. Depending on the preset of the capture session, or depending on the device or screen size of the preview layer, the coordinate systems of the frame buffer and displayed area can both change. I've tried all the combinations of transforms I can think of, but I haven't found the magic formula yet. Anyone know how to make this happen?

Here is some code...

I initialize the output/layers:

detectionOutput = AVCaptureVideoDataOutput()
detectionOutput.alwaysDiscardsLateVideoFrames = true
detectionOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "sampleBufferQueue"))
if let captureSession = captureSession, captureSession.canAddOutput(detectionOutput) {
  captureSession.addOutput(detectionOutput)
} else {
  print("Capture session could not be established.")
  return
}

videoPreviewLayer = AVCaptureVideoPreviewLayer(session: captureSession)
videoPreviewLayer.frame = view.layer.bounds
videoPreviewLayer.videoGravity = .resizeAspectFill
view.layer.addSublayer(videoPreviewLayer)

documentOverlayLayer = CAShapeLayer()
documentOverlayLayer.frame = videoPreviewLayer.frame
documentOverlayLayer.strokeColor = UIColor.red.cgColor
documentOverlayLayer.lineWidth = 2
documentOverlayLayer.fillColor = UIColor.clear.cgColor
videoPreviewLayer.addSublayer(documentOverlayLayer)

I then capture the frame buffers like so:

  func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
    let ciImage = CIImage(cvPixelBuffer: pixelBuffer)
    detectDocument(in: ciImage, withOrientation: exifOrientationFromDeviceOrientation())
  }

Detect the document like so:

  private func detectDocument(in image: CIImage, withOrientation orientation: CGImagePropertyOrientation) {
    let requestHandler = VNImageRequestHandler(ciImage: image, orientation: orientation, options: [:])
    let documentDetectionRequest = VNDetectDocumentSegmentationRequest { [weak self] request, error in
      DispatchQueue.main.async {
        guard let self = self else { return }
        
        guard let results = request.results as? [VNRectangleObservation],
              let result = results.first else {
          // No results
          self.detectedRectangle = nil
          self.documentOverlayLayer.path = nil
          return
        }
        
        if result.confidence < 0.5 {
          // Too low confidence
          self.detectedRectangle = nil
          self.documentOverlayLayer.path = nil
        }
        else {
          self.detectedRectangle = result
          self.drawRectangle(result, inBounds: image.extent)
        }
      }
    }

And then attempt to preview it like so:

  private func drawRectangle(_ rectangle: VNRectangleObservation, inBounds: CGRect) {
    let xScale = videoPreviewLayer.frame.width * videoPreviewLayer.contentsScale
    let yScale = videoPreviewLayer.frame.height * videoPreviewLayer.contentsScale
    
    // Transforming Vision coordinates to UIKit coordinates
    // HELP!!! Despite all kinds of combinations of outputRectConverted, layerRectConverted, manually-created transforms or others, I can't get the rectangles to consistently line up with the image... 
    let topLeft = CGPoint(x: rectangle.topLeft.x * xScale, y: (1 - rectangle.topLeft.y) * yScale)
    let topRight = CGPoint(x: rectangle.topRight.x * xScale, y: (1 - rectangle.topRight.y) * yScale)
    let bottomLeft = CGPoint(x: rectangle.bottomLeft.x * xScale, y: (1 - rectangle.bottomLeft.y) * yScale)
    let bottomRight = CGPoint(x: rectangle.bottomRight.x * xScale, y: (1 - rectangle.bottomRight.y) * yScale)
    
    // Create a UIBezierPath from the transformed points
    let path = UIBezierPath()
    path.move(to: topLeft)
    path.addLine(to: topRight)
    path.addLine(to: bottomRight)
    path.addLine(to: bottomLeft)
    path.close()
    
    DispatchQueue.main.async {
      self.documentOverlayLayer.path = path.cgPath
    }
  }

Original Q&A

There are 1 answers

**samkass** · Answer 1 · 2023-11-23T19:49:59+00:00

Well, a day later I figured it out with the help of Apple's Highlighting Areas of Interest in an Image Using Saliency example code.

The trick was to:

ensure the overlay layer was a sublayer of the video preview layer and didn't explicitly set any frame
on "viewDidLayoutSubviews()", before calling super, update the layer geometry, determine the transform with the preview layer's "layerRectConverted", and save the affine transform for later use.
Apply that transform to the shape after drawing the rectangles

To wit, here's the new view initialization code for the preview layers:

let layer = AVCaptureVideoPreviewLayer(session: captureSession)
layer.frame = view.layer.bounds
layer.videoGravity = .resizeAspectFill

let highlightColor = UIColor.red.cgColor
documentOverlayLayer.strokeColor = highlightColor
documentOverlayLayer.lineWidth = 3
documentOverlayLayer.fillColor = highlightColor.copy(alpha: 0.5)

layer.addSublayer(documentOverlayLayer)
view.layer.addSublayer(layer)
videoPreviewLayer = layer

And here's the layout updating code (with previewTransform stored as an instance variable):

  override func viewDidLayoutSubviews() {
    updateLayersGeometry()
    super.viewDidLayoutSubviews()
  }
  
  func updateLayersGeometry() {
    if let baseLayer = videoPreviewLayer {
      // Align overlay layer with video content rect
      let outputRect = CGRect(x: 0, y: 0, width: 1, height: 1)
      let videoRect = baseLayer.layerRectConverted(fromMetadataOutputRect: outputRect)
      documentOverlayLayer.frame = videoRect
      // transform to convert from normalized coordinates to layer's coordinates
      let scaleT = CGAffineTransform(scaleX: documentOverlayLayer.bounds.width, y: -documentOverlayLayer.bounds.height)
      let translateT = CGAffineTransform(translationX: 0, y: documentOverlayLayer.bounds.height)
      previewTransform = scaleT.concatenating(translateT)
    }
  }

Then when generating the preview rectangle, I just call the UIBezierPath with the raw rectangle coordinates and call path.apply(transform) at the end, and it matches the preview coordinates!

  private func createRectanglePath(_ rectangle: VNRectangleObservation, 
                                   transform: CGAffineTransform) -> CGPath {
    let path = UIBezierPath()
    path.move(to: rectangle.topLeft)
    path.addLine(to: rectangle.topRight)
    path.addLine(to: rectangle.bottomRight)
    path.addLine(to: rectangle.bottomLeft)
    path.close()
    path.apply(transform)
    
    return path.cgPath
  }

TechQA.

Live Preview of Apple's Vision API's

There are 1 answers

Related Questions in IOS

Related Questions in AVFOUNDATION

Related Questions in AVCAPTURESESSION

Related Questions in APPLE-VISION

Related Questions in IOS-VISION

Popular Questions

Popular Tags

Trending Questions