LiDAR Depth + Vision Hand Tracking for 3D Hand Tracking

1.4k views Asked by At

I want to use Vision 2D Hand Tracking input coupled with ARKit > People Occlusion > Body Segmentation With Depth, which leverage LiDAR, to get 3D World Coordinates of the tip of the index.

Steps I am doing:

1 - The 2D screen location of the finger tip provided by Vision works

2 - The Depth data from the CVPixelBuffer seems correct too

3 - The unprojection from 2D Screen Coordinates + Depth data to 3D World Coordinates is wrong

Ideally I could have a result similar to the LiDAR Lab app by Josh Caspersz:

enter image description here

Here is my code which treats the 2D point coordinates + depth into 3D world coordinates:

// Result from Vision framework
// Coordinates top right of the screen with Y to the left, X down
indexTip = CGPoint(x:(indexTipPoint.location.x) * CGFloat(arView.bounds.width),
                           y:(1 - indexTipPoint.location.y) * CGFloat(arView.bounds.height))
        
if let segmentationBuffer:CVPixelBuffer = frame.estimatedDepthData {

      let segmentationWidth = CVPixelBufferGetWidth(segmentationBuffer)
      let segmentationHeight = CVPixelBufferGetHeight(segmentationBuffer)
            
      let xConverted:CGFloat = indexTip.x * CGFloat(segmentationWidth) / CGFloat(arView.bounds.width)
      let yConverted:CGFloat = indexTip.y * CGFloat(segmentationHeight) / CGFloat(arView.bounds.height)

      if let indexDepth:Float = segmentationBuffer.value(column: Int(xConverted), row: Int(yConverted)) {

           if indexDepth != 0 {
                 let cameraIntrinsics = frame.camera.intrinsics

                 var xrw: Float = (Float(indexTip.x) - cameraIntrinsics[2][0]) * indexDepth
                 xrw = xrw / cameraIntrinsics[0][0]
                 var yrw: Float = (Float(indexTip.y) - cameraIntrinsics[2][1]) * indexDepth
                 yrw = yrw / cameraIntrinsics[1][1]
                 let xyzw: SIMD4<Float> = SIMD4<Float>(xrw, yrw, indexDepth, 1.0)
                 let vecResult = frame.camera.viewMatrix(for: .portrait) * xyzw
                    
                 resultAnchor.setPosition(SIMD3<Float>(vecResult.x, vecResult.y, vecResult.z), relativeTo: nil)
           }
      }
 }

Here is a video of what it looks like when running, seems to be always located at a specific area in space: Video

The calculations are basically the ones from the Sample Code Displaying a Point Cloud Using Scene Depth

enter image description here

Finally here is the full zip file if you want to try it yourself : ZIP.

Any idea what is wrong in my calculations?

1

There are 1 answers

0
Reality-Dev On

@oscar-falmer Yes, I wrote that answer on the Apple Developer forums and made that body tracking package. I also tried linking to them here, but someone came along and deleted mine because they were little more than links. Here is the solution copied here.

The Vision result comes in Vision coordinates: normalized, (0,0) Bottom-Left, (1,1) Top-Right. AVFoundation coordinates are (0,0) Top-Left, (1,1) Bottom-Right. To convert from Vision coordinates to AVFoundation coordinates, you must flip the Y-axis like so:

public extension CGPoint {
    func convertVisionToAVFoundation() -> CGPoint {
        return CGPoint(x: self.x, y: 1 - self.y)
    }
}

This AVFoundation coordinate is what needs to be used as input for indexing the depth buffer, like so:

public extension CVPixelBuffer {

    ///The input point must be in normalized AVFoundation coordinates. i.e. (0,0) is in the Top-Left, (1,1,) in the Bottom-Right.

    func value(from point: CGPoint) -> Float? {

        let width = CVPixelBufferGetWidth(self)

        let height = CVPixelBufferGetHeight(self)

        let colPosition = Int(point.x * CGFloat(width))

        let rowPosition = Int(point.y * CGFloat(height))

        return value(column: colPosition, row: rowPosition)

    }

    func value(column: Int, row: Int) -> Float? {

        guard CVPixelBufferGetPixelFormatType(self) == kCVPixelFormatType_DepthFloat32 else { return nil }

        CVPixelBufferLockBaseAddress(self, .readOnly)

        if let baseAddress = CVPixelBufferGetBaseAddress(self) {

            let width = CVPixelBufferGetWidth(self)

            let index = column + (row * width)

            let offset = index * MemoryLayout<Float>.stride

            let value = baseAddress.load(fromByteOffset: offset, as: Float.self)

                CVPixelBufferUnlockBaseAddress(self, .readOnly)

            return value

        }
        CVPixelBufferUnlockBaseAddress(self, .readOnly)

        return nil
    }
}

This is all that is needed to get depth for a given position from a Vision request.

If you would like to find the position on screen for use with something such as UIKit or ARView.ray(through:) further transformation is required. The Vision request was performed on arView.session.currentFrame.capturedImage From the documentation on ARFrame.displayTransform(for:viewportSize:):

Normalized image coordinates range from (0,0) in the upper left corner of the image to (1,1) in the lower right corner. This method creates an affine transform representing the rotation and aspect-fit crop operations necessary to adapt the camera image to the specified orientation and to the aspect ratio of the specified viewport. The affine transform does not scale to the viewport's pixel size. The capturedImage pixel buffer is the original image captured by the device camera, and thus not adjusted for device orientation or view aspect ratio.

So the image being rendered on screen is a cropped version of the frame that the camera captures, and there is transformation needed to go from AVFoundation coordinates to display (UIKit) coordinates. Converting from AVFoundation coordinates to display (UIKit) coordinates:

public extension ARView {

      func convertAVFoundationToScreenSpace(_ point: CGPoint) -> CGPoint? {

        //Convert from normalized AVFoundation coordinates (0,0 top-left, 1,1 bottom-right)

        //to screen-space coordinates.

        guard

            let arFrame = session.currentFrame,

            let interfaceOrientation = window?.windowScene?.interfaceOrientation

        else {return nil}

            let transform = arFrame.displayTransform(for: interfaceOrientation, viewportSize: frame.size)

            let normalizedCenter = point.applying(transform)

            let center = normalizedCenter.applying(CGAffineTransform.identity.scaledBy(x: frame.width, y: frame.height))

            return center
    }
}

To go the opposite direction, from UIKit display coordinates to AVFoundation coordinates:

public extension ARView {

    func convertScreenSpaceToAVFoundation(_ point: CGPoint) -> CGPoint? {

        //Convert to normalized pixel coordinates (0,0 top-left, 1,1 bottom-right)

        //from screen-space UIKit coordinates.

        guard

          let arFrame = session.currentFrame,

          let interfaceOrientation = window?.windowScene?.interfaceOrientation

        else {return nil}

          let inverseScaleTransform = CGAffineTransform.identity.scaledBy(x: frame.width, y: frame.height).inverted()

          let invertedDisplayTransform = arFrame.displayTransform(for: interfaceOrientation, viewportSize: frame.size).inverted()

          let unScaledPoint = point.applying(inverseScaleTransform)

          let normalizedCenter = unScaledPoint.applying(invertedDisplayTransform)

          return normalizedCenter
    }
}

To get a world-space coordinate from a UIKit screen coordinate and a corresponding depth value:


    /// Get the world-space position from a UIKit screen point and a depth value
    /// - Parameters:
    ///   - screenPosition: A CGPoint representing a point on screen in UIKit coordinates.
    ///   - depth: The depth at this coordinate, in meters.
    /// - Returns: The position in world space of this coordinate at this depth.
    private func worldPosition(screenPosition: CGPoint, depth: Float) -> simd_float3? {

        guard

            let rayResult = arView.ray(through: screenPosition)

        else {return nil}

        //rayResult.direction is a normalized (1 meter long) vector pointing in the correct direction, and we want to go the length of depth along this vector.

         let worldOffset = rayResult.direction * depth

         let worldPosition = rayResult.origin + worldOffset

         return worldPosition
    }

To set the position of an entity in world space for a given point on screen:


    let currentFrame = arView.session.currentFrame,

    let sceneDepth = (currentFrame.smoothedSceneDepth ?? currentFrame.sceneDepth)?.depthMap

    let depthAtPoint = sceneDepth.value(from: avFoundationPosition),

    let worldPosition = worldPosition(screenPosition: uiKitPosition, depth: depthAtPoint)

    trackedEntity.setPosition(worldPosition, relativeTo: nil)

And don't forget to set the proper frameSemantics on your ARConfiguration:

    func runNewConfig(){

        // Create a session configuration
        let configuration = ARWorldTrackingConfiguration()

        //Goes with (currentFrame.smoothedSceneDepth ?? currentFrame.sceneDepth)?.depthMap
        let frameSemantics: ARConfiguration.FrameSemantics = [.smoothedSceneDepth, .sceneDepth]

        //Goes with currentFrame.estimatedDepthData
        //let frameSemantics: ARConfiguration.FrameSemantics = .personSegmentationWithDepth


        if ARWorldTrackingConfiguration.supportsFrameSemantics(frameSemantics) {
            configuration.frameSemantics.insert(frameSemantics)
        }

        // Run the view's session

        session.run(configuration)
    }