I think you’re over complicating it.
This is more like what we really need.
We’d use it differently, and not so much to animate Poppy. We go through the set up check list, and then Poppy, can recognize our facial expressions. That’s a form of Machine Vision, we can use and modify to our likings. We could throw a coke can there, label it, and then Poppy could find. It’s not a facial expression to recognize but, it is an object. We might use CAD, to draw the 3D image of the object we want the robot to recognize, just to generate a proper point cloud.
It’s just that two cameras put together, can resolve a more accurate point cloud, and as a result, allow more room for the robot avoid obstacles and object oriented missions. Cleaning a toilet isn’t easy. So, think about all of the cleaners, tools, etc. you need yourself to clean it, and then step by step with just the objects and targets required.
Bottom line, in electronics, let’s say, we have a missing capacitor but, we know what frequency the circuit is tuned to, there’s a means of working the equations to find that value. This is same thing is true about all of the math the Visual System of a Robot has to do but, in reverse. So, instead of projecting two images on a screen, edge detection, give’s a fair alignment for beginning a point cloud. The point cloud, is just the way a 3D gaming engine is being used. Instead of vertexes, all you have is dots in 3D space, then you’d have the computer start drawing the vertexes. Units of measure are known within the 3D. It’s full of all approximations, and they’re probably within an 1/8th of an inch, if you looked at the resolution of 3D. It would probably only need to be as good as Quake, or Quake II. But, instead of running the game, the AI is plotting a path through obstacles, and instead of the robot making all kinds of mistakes, if it’s been in the room, and all around in it, then it already has a fairly complete 3D map, and the only things it would notice is changes since the last time it was there. It wouldn’t even ask any questions just update the map. But, a gaming AI, is enough for navigation, once you have a 3D map of a room. Most of the software, already exists, it’s just that the pieces we need, aren’t all of the pieces that they need.
For now, the best example is ASIMO’s greeting/dismissal, where it just bows politely. He could stand too close, or too far away, and even though we feel it’s just being polite, we’re probably more lucky not to bump heads.
But, one of the first bugs that this software would have presented, is already a solution. The problem it would have had, just in it’s very beginnings as a program is really a solution we need. That would be when it first 3D mapped properly, it would have mapped the whole image. There’s the room we need for navigation.