I am trying to write an iOS app which scans documents for processing with Apple's Vision API. The goal is to have the live video feed display what the camera sees on the screen, with the outlines of a document highlighted and filled in with a semi-transparent color, and have it track the video as the phone moves around until the user has it framed the way they like and take the snapshot.
To do this I create 3 things: an AVCaptureVideoDataOutput to get the frame buffers for analysis, an AVCaptureVideoPreviewLayer to display the video preview, and a CAShapeLayer to show the outline/fill. Everything seems to be working great-- I get a frame, I kick off a VNDetectDocumentSegmentationRequest in the background to quickly get a candidate quadrilateral, and I throw a task on the main DispatchQueue to update the shape layer.
However, the coordinate systems do not line up. Depending on the preset of the capture session, or depending on the device or screen size of the preview layer, the coordinate systems of the frame buffer and displayed area can both change. I've tried all the combinations of transforms I can think of, but I haven't found the magic formula yet. Anyone know how to make this happen?
Here is some code...
I initialize the output/layers:
detectionOutput = AVCaptureVideoDataOutput()
detectionOutput.alwaysDiscardsLateVideoFrames = true
detectionOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "sampleBufferQueue"))
if let captureSession = captureSession, captureSession.canAddOutput(detectionOutput) {
captureSession.addOutput(detectionOutput)
} else {
print("Capture session could not be established.")
return
}
videoPreviewLayer = AVCaptureVideoPreviewLayer(session: captureSession)
videoPreviewLayer.frame = view.layer.bounds
videoPreviewLayer.videoGravity = .resizeAspectFill
view.layer.addSublayer(videoPreviewLayer)
documentOverlayLayer = CAShapeLayer()
documentOverlayLayer.frame = videoPreviewLayer.frame
documentOverlayLayer.strokeColor = UIColor.red.cgColor
documentOverlayLayer.lineWidth = 2
documentOverlayLayer.fillColor = UIColor.clear.cgColor
videoPreviewLayer.addSublayer(documentOverlayLayer)
I then capture the frame buffers like so:
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
let ciImage = CIImage(cvPixelBuffer: pixelBuffer)
detectDocument(in: ciImage, withOrientation: exifOrientationFromDeviceOrientation())
}
Detect the document like so:
private func detectDocument(in image: CIImage, withOrientation orientation: CGImagePropertyOrientation) {
let requestHandler = VNImageRequestHandler(ciImage: image, orientation: orientation, options: [:])
let documentDetectionRequest = VNDetectDocumentSegmentationRequest { [weak self] request, error in
DispatchQueue.main.async {
guard let self = self else { return }
guard let results = request.results as? [VNRectangleObservation],
let result = results.first else {
// No results
self.detectedRectangle = nil
self.documentOverlayLayer.path = nil
return
}
if result.confidence < 0.5 {
// Too low confidence
self.detectedRectangle = nil
self.documentOverlayLayer.path = nil
}
else {
self.detectedRectangle = result
self.drawRectangle(result, inBounds: image.extent)
}
}
}
And then attempt to preview it like so:
private func drawRectangle(_ rectangle: VNRectangleObservation, inBounds: CGRect) {
let xScale = videoPreviewLayer.frame.width * videoPreviewLayer.contentsScale
let yScale = videoPreviewLayer.frame.height * videoPreviewLayer.contentsScale
// Transforming Vision coordinates to UIKit coordinates
// HELP!!! Despite all kinds of combinations of outputRectConverted, layerRectConverted, manually-created transforms or others, I can't get the rectangles to consistently line up with the image...
let topLeft = CGPoint(x: rectangle.topLeft.x * xScale, y: (1 - rectangle.topLeft.y) * yScale)
let topRight = CGPoint(x: rectangle.topRight.x * xScale, y: (1 - rectangle.topRight.y) * yScale)
let bottomLeft = CGPoint(x: rectangle.bottomLeft.x * xScale, y: (1 - rectangle.bottomLeft.y) * yScale)
let bottomRight = CGPoint(x: rectangle.bottomRight.x * xScale, y: (1 - rectangle.bottomRight.y) * yScale)
// Create a UIBezierPath from the transformed points
let path = UIBezierPath()
path.move(to: topLeft)
path.addLine(to: topRight)
path.addLine(to: bottomRight)
path.addLine(to: bottomLeft)
path.close()
DispatchQueue.main.async {
self.documentOverlayLayer.path = path.cgPath
}
}
Well, a day later I figured it out with the help of Apple's Highlighting Areas of Interest in an Image Using Saliency example code.
The trick was to:
To wit, here's the new view initialization code for the preview layers:
And here's the layout updating code (with previewTransform stored as an instance variable):
Then when generating the preview rectangle, I just call the UIBezierPath with the raw rectangle coordinates and call
path.apply(transform)
at the end, and it matches the preview coordinates!