I want to use Vision 2D Hand Tracking input coupled with ARKit > People Occlusion > Body Segmentation With Depth, which leverage LiDAR, to get 3D World Coordinates of the tip of the index.
Steps I am doing:
1 - The 2D screen location of the finger tip provided by Vision works
2 - The Depth data from the CVPixelBuffer seems correct too
3 - The unprojection from 2D Screen Coordinates + Depth data to 3D World Coordinates is wrong
Ideally I could have a result similar to the LiDAR Lab app by Josh Caspersz:
Here is my code which treats the 2D point coordinates + depth into 3D world coordinates:
// Result from Vision framework
// Coordinates top right of the screen with Y to the left, X down
indexTip = CGPoint(x:(indexTipPoint.location.x) * CGFloat(arView.bounds.width),
y:(1 - indexTipPoint.location.y) * CGFloat(arView.bounds.height))
if let segmentationBuffer:CVPixelBuffer = frame.estimatedDepthData {
let segmentationWidth = CVPixelBufferGetWidth(segmentationBuffer)
let segmentationHeight = CVPixelBufferGetHeight(segmentationBuffer)
let xConverted:CGFloat = indexTip.x * CGFloat(segmentationWidth) / CGFloat(arView.bounds.width)
let yConverted:CGFloat = indexTip.y * CGFloat(segmentationHeight) / CGFloat(arView.bounds.height)
if let indexDepth:Float = segmentationBuffer.value(column: Int(xConverted), row: Int(yConverted)) {
if indexDepth != 0 {
let cameraIntrinsics = frame.camera.intrinsics
var xrw: Float = (Float(indexTip.x) - cameraIntrinsics[2][0]) * indexDepth
xrw = xrw / cameraIntrinsics[0][0]
var yrw: Float = (Float(indexTip.y) - cameraIntrinsics[2][1]) * indexDepth
yrw = yrw / cameraIntrinsics[1][1]
let xyzw: SIMD4<Float> = SIMD4<Float>(xrw, yrw, indexDepth, 1.0)
let vecResult = frame.camera.viewMatrix(for: .portrait) * xyzw
resultAnchor.setPosition(SIMD3<Float>(vecResult.x, vecResult.y, vecResult.z), relativeTo: nil)
}
}
}
Here is a video of what it looks like when running, seems to be always located at a specific area in space: Video
The calculations are basically the ones from the Sample Code Displaying a Point Cloud Using Scene Depth
Finally here is the full zip file if you want to try it yourself : ZIP.
Any idea what is wrong in my calculations?
@oscar-falmer Yes, I wrote that answer on the Apple Developer forums and made that body tracking package. I also tried linking to them here, but someone came along and deleted mine because they were little more than links. Here is the solution copied here.
The Vision result comes in Vision coordinates: normalized, (0,0) Bottom-Left, (1,1) Top-Right. AVFoundation coordinates are (0,0) Top-Left, (1,1) Bottom-Right. To convert from Vision coordinates to AVFoundation coordinates, you must flip the Y-axis like so:
This AVFoundation coordinate is what needs to be used as input for indexing the depth buffer, like so:
This is all that is needed to get depth for a given position from a Vision request.
If you would like to find the position on screen for use with something such as UIKit or
ARView.ray(through:)
further transformation is required. The Vision request was performed onarView.session.currentFrame.capturedImage
From the documentation onARFrame.displayTransform(for:viewportSize:)
:So the image being rendered on screen is a cropped version of the frame that the camera captures, and there is transformation needed to go from AVFoundation coordinates to display (UIKit) coordinates. Converting from AVFoundation coordinates to display (UIKit) coordinates:
To go the opposite direction, from UIKit display coordinates to AVFoundation coordinates:
To get a world-space coordinate from a UIKit screen coordinate and a corresponding depth value:
To set the position of an entity in world space for a given point on screen:
And don't forget to set the proper frameSemantics on your
ARConfiguration
: