So today I stumbled across this IEEE spectrum post on Care-o-bot, mark 4. I hadn’t actually come across the previous iterations of Care-o-bots, but I must say, looking at the video of the first attempt, I doubt there is much to be missing out on. Sorry Fraunhofer…
There is a link to a rather fun little promotional video of the robot, and as I watched it, I certainly felt a light hearted joy. The music was upbeat and gently fun, the basic unfolding of the plot was touching, and there is great attention to detail in the animation of the robot. HRI lessons have been picked up well here.
The robot design is simple at the heart. Slick white paint, some LEDs to provide robot specific feedback channels, and two simple dots for eyes. There is a whole host of research from HRI that demonstrates that simple is very effective, when used properly.
I particularly liked the arm movements when the robot is rushing around to get the rose. It certainly helps us empathize with the robot, no questions there. However, there is a price to pay for this. Battery cell technology is certainly coming along, but for robots energy is still a very precious resource.
So, my question, and it’s a philosophical one, and depends on how much you buy into the H in HRI, and of course what function the robot serves. Those rushing around arm movements that clearly conveyed a robot rushing around to achieve a goal. And in turn, given that, you might infer that the goal has quite some importance. They probably required a fair bit of power. Where they value for watts? Was it worth expending all that energy for the robot to rush around? Did they really bring that something truly extra to the robot?
I don’t think that there is a right or a wrong answer, but I think that this does remind us that from a practical perspective (at least for now), energy is precious to robots, and we as robot designers must contemplate whether we are using it most effectively.
Our lab got a new companion today in the form of a SociBot, a hand made/assembled 3 DoF robot head with a Retro Projected Face (RPF) sitting on top of a static torso, produced by Engineered Arts Ltd in Cornwall, UK. Setting you back about £10,000, it’s an interesting piece of kit and could well be a milestone in the development toward very (convincing) facially expressive robots.
So, how do these things work? Well, it’s actually really simple. You take a projector, locate it in the back of the robot’s head and project a face onto a translucent material which has the profile of a face. This way you as the user can’t see into the head, but you can see the face. It’s cheap, it’s simple and it’s becoming more popular at the moment. Retro-projected faces are not exactly a new idea in the world of social robotics however (and they have an even longer history in the world of theater I’ve come to learn). There’re have been a few universities exploring them in Europe, with a notable early example coming from Fred Delaunay who did his PhD work at Plymouth a couple of years ago on his LightHead robot (Fred has since gone on to pursue a startup company, Syntheligence, in Paris to commercialise his efforts since it seems to have caught on quite a bit). That said, they do have many useful things to offer the world of robotics.
For example, the face has no moving parts, and thus actuator noise is less of an issue (which isn’t altogether true, projectors get hot and do have a fun whirring away). Faces can also be rendered in real time (thus noiseless eye saccades, blinking and even pupil dilations are possible), and are very flexible to alterations such as blushing. You can put any face you want on SociBot, with animation possibilities where the sky’s the limit. The folks who put together the software for the face system (amusingly called “InYaFace”) have also made a model of Paul Ekman’s influential Facial Action Coding system (FACs) which allows the robot to make all kinds of facial expressions, and allows it to tie in very well with research output that use the FACs system as a foundation.
The robot itself run’s Linux Ubuntu and uses lots of off the shelf software such as OpenNI for the Acus XTion Pro to do body tracking and gesture recognition, and face/age/emotion detection software using the head mounted camera as the feed source. TTS is also built in, as is Speech Recognition (I think), but there is no built in mic, only a mic input port. Also, we (by that I mean Tony) bought the SociBot Mini, but there is also a larger version where you can add a touch screen/kiosk as joint virtual space that both the robot and user can manipulate.
SociBot can be programmed using Python and has a growing API which is constantly under development and it looks promising. You also have the ability to program the robot via a web based GUI as Engineered Arts seem keen to have the robots be an “open” platform where you can log in remotely to control them.
Unboxing the robot
Given my recent post about the unveiling of Pepper by Aldebaran and the new Nao Evo that we got a little while ago, I was curious to see what Engineered Arts had put into the robot as a default behaviour. I must admit that I was rather pleasantly surprised. The robot appeared to be doing human tracking and seemed to look at new people when they came into view.
To make comparison with Aldebaran, I think that what really stands out here is that it took very little effort to start playing with the different aspects of the robot. When you get a Nao, it really doesn’t do that much when turned on for the first time, and there is quite a bit of setup required before you get to see some interesting stuff happen. Through the web interface however, we were quickly able to play with the different faces and expressions that SociBot has to offer, as well as the different little entertaining programs such as the robot singing a David Bowie number as well as blasting out “Singing in the Rain”. Lots of laughs were to be had very quickly and it was good fun exploring what you can do with the robot once it arrives.
From a product design / UX perspective, Engineered Arts have got this spot on. When you open the box of a robot, this is perhaps the most excited you will ever be about that robot, and making it easy to play with the thing and leaving a very good impression. Overall, however, some things are a bit rough around the edges, but the fact that this is a hand made robot tells you everything that you need to know about Engineered Arts: they really do like to build robots in house.
Drawbacks to retro-projected faces
Now that post is in part an overview of the current state-of-the-art in RPF technology, I can certainly see room for improvement as there are some notable shortcomings. Let’s start with some low hanging fruit, which comes in the form of volume used. As the face is projected, most of the head is actually empty space in order to avoid having yourself a little shadow puppet show. This (lacking) use of the space is actually inefficient in my view as it puts quite some limitations on where you can locate cameras. Both LightHead and SociBot have a camera mounted at the very top of the forehead. This means that you loose out on potential video quality and sensory. Robots like iCub and Romero have cameras that are mounted in actuated eyes which allows them to converge (which is interesting from a Developmental Robotics point of view), but also compensate for robot’s head and body movements, providing a more stable video feed. Perhaps this is a slightly minor point as the robot’s are generally static in the grand scheme of things, when they start to walk, I can see things changing quickly (in fact this is exactly why Romeo has cameras in the eyes).
Another problem is to do with luminosity and energy dispersion. As the “screen ratio” of the modern off the shelf projector isn’t really suitable to cover a full face at the distance required, these systems turn to specialised optics in order to increase the projection field of view. However, this comes at the cost of spreading the energy in the projection over a larger area which results in a lower overall luminosity over a given area. This is furthered even more as when the projection hits the translucent material, the energy refracts and scatters even more, which means you loose more luminosity, as well as the image loosing some sharpness. Of course the obvious solution is to put in a more powerful projector, but this has the drawback that it will likely get hotter, thus needing more fan cooling, and with that fan running in a hollow head, that sound echoes and reverberates.
Personally, I’m still waiting for the flexible electronic screens technology to develop as this will likely overcome most of these issues. If you can produce a screen that takes the shape of a face, you suddenly no longer need a projector at all. You gain back the space lost in head, loose the currently noisy fan that echos and luminosity becomes less of an issue. Marry this with touch screen technology and perhaps actuation mounted under the screen and I think that you have a very versatile piece of kit indeed!
Video Posted on Updated on
Ever since reading Cynthia Breazeal’s book, “Designing Sociable Robots“, I’ve had this constant itch to implement her visual attention model on a robot, mainly the Nao as there’re four of them laying around in the lab these days. So, suffice to say that I’ve finally gotten around to scratching this particular itch, and boy does it feel good! 🙂
So, if you haven’t already read this book (and if you work in social robotics, shame on you), I highly recommend it! It’s full of lots in interesting insights and thoughts, and it is a sure read for any new MSc/PhD students that might be embarking on their research journeys.
To get to the point, in one of the chapters, Breazeal describes the vision system running on Kismet. This is actually something that was developed by Brian Scassellati (whilst working on “Cog”, if I recall), and I must say, I think that it is a little gem (hence why I wanted to see it run on the Nao). The model is intended to make the robot attend to things that it can see in the environment (e.g. things that move, people, objects, colours, ect) using basic visual features. Basically a bottom-up approach to visual processing: take lots of basic, simple features, and combine “upwards” to something that is more complex.
I’ve finally implemented the model, from scratch and made it run using either a Desktop webcam, or using it with an Aldebaran Nao. This little personal project also holds a more serious utility. I’m now beginning to make an online portfolio of my coding skills as I have seen some employers request example code recently (and I’m currently on a job hunt). I’ve made two YouTube videos of the model. The first is it running on my Desktop machine in the lab, where I talk through the model and the parameters that drive it. In the second video I show the slightly adapted version running with a Nao. Here are those two videos:
I have to admit that there is certainly room for improvement and fine tuning in the parameter settings, as well as some nice extensions. For example I had a bit of trouble as there is quite a lot of red in our office and the robot was immediately drawn to this. Either I need to change the method for attention point selection, or I need to take distance into account in some way (but there isn’t and RGBD sensor on the Nao at the moment). Currently for attention point selection I am finding all the pixels that share the same max value in the Saliency Map and finding the Center of Mass of the largest connected region of these. Alas in the videos this was sometimes background items…
Talking about possible extensions, I certainly see alot of room to have an adaptive mechanism that provides the “Top Down” task orientated control of the feature weights (at least) as was done with Kismet. There are a small subset of the different parameters driving the model and finding values that work can be a little tricky. Furthermore, I suspect as soon as you change setting, you will need to tweak parameters again.
Coding this system up also made me think about the blog post I wrote a about what a robot should do out of the box. I recall that the Nao was doing at least face detection and tracking. I pondered the idea of whether this kind of model would work as on out of the box program. Rather than having fixed weights, the robot could have some pre-set modes (as Kistmet did) and just cycle through these at different intervals. Perhaps the biggest problem will be the onboard processing that would need to happen. My program is multi-threaded (each feature map is computed in it’s own thread, as is the Nao motor control) and isn’t exactly computationally cheap, and so I can see it using quite a bit of the processing resources.
Anyway, there are lots of possibilities with this model both with respect to tweaking it, extending it, and merging it with other “modules” that do other things. As such, I’ve made the code available to download:
Desktop + webcam version (needs Qt SDK, OpenCV libs and ArUco libs): Link
Version for the Nao (needs Qt SDK, OpenCV libs, ArUco libs and NaoQi C++ SDK, v 1.14.5 in my case): Link
Note: With the NaoQi SDK, this isn’t free. You need to be a Developer and I have access through the Research Projects at Plymouth University. I can’t provide you with the SDK as this would go against the agreement we have with Aldebaran… Sorry… 😦
Over the last few months the Plymouth Node of ALIZ-E has been working hard toward a rather large experiment that got underway at the start of the week. Last month we went to the Natural History Museum to showcase some of Plymouth University’s robotics research for Universities Week. That was used as a chance to give our system a hard beta test, which generally went very well. We identified some things that needed to be changed and made more robust, but all in all, we were very happy with the outcome.
After two weeks more development and ironing out creases in the code, we finally deployed our two systems to a local Plymouth Primary school on Monday morning this week and it will stay in two different classrooms for two weeks. You can see the setup below.
So, the system consists of a 27 inch all in one touch screen computer which essentially provides a shared input/malleable space for both the human and the robot. The Nao stands/kneels on one side of the screen, while the human is located on the other side, facing the robot. Behind the Nao is a Kinect sensor that is facing the human and can see just over the top of the Nao’s head.
Basically, on the touchscreen we run some programs that allow the children to play a non-competitive game with the robot. This is a categorisation task, where icons on screen (e.g. numbers, planets, colours) have to be moved into a particular location on the screen in order to be categorized in some way (e.g. odd or even number, numbers that are/not part of the 4 timetable, planets with moons and without moons). By keeping the basic underlying task simple, it is easy to change the icons and the categories to produce a wide repertoire of different games that are easily accessible for young children. As part of this game experience/setup, each child in the class has their own “saved account” (which they select by selecting their name on screen when prompted by the robot) which allows the robot to track their progress through the various games, and in turn pitch the level of difficultly accordingly. The robot also takes part in this game by also, moving icons onscreen, and as it is very open ended, turn taking tends to emerge from the process naturally as a result.
The rational for using a touchscreen is that is provides an interactive medium which is simultaneously intuitive for the human and the robot. In essence, the human is able to manipulate objects on screen in both a natural manner (say select them, drag and drop, ect) as well as being able to use similar cues as may be used with tables and such devices (e.g. swipe, pinch to zoom, ect). On the part of the robot, the touchscreen is a computer, makes communication between it and the robot very simple. Further more, when you present something on screen, the robot is able to gain lots of knowledge about objects that are being manipulated (and how) that may otherwise have been very difficult to obtain in the physical world. By providing a shared, malleable space that is virtual rather than physical, you overcome many of the limitations in both the Nao’s ability to manipulate real objects, as well as its perception limitations.
We use a Kinect for a similar reason. The Kinect is used to estimate head pose, and thus determine whether the human is looking at the screen, at the human, or somewhere else. This allows us to, in part, drive the gaze behaviour of the robot. For example, if the human is looking at the screen, then the robot can establish joint attention by doing the same, or if the human is looking at the robot then the robot knows that it can make eye contact. It is useful to have the Kinect act as a “fly in the corner” because some of these issues are difficult to achieve using the built in cameras on the Nao, particularly as they move as the head gazes around.
We’re keeping our fingers crossed that we aren’t bitten too hard by the “Demo Effect” (which means that as soon as it comes to actually showing a system working to real people, a fully working system in the lab decides to have a catastrophic break down…). Doubtless that we will seek to publish our results from the experiment in due course (so sorry for not detailing the experiment itself here), so watch this space… 😉
I recently attended (and presented at) the HRI’14 conference in Bielefeld which exhibited lots of the latest and greatest developments in the field of HRI from across the world. HRI is a highly selective conference (23% acceptance rate or so), and while getting a full paper in the conference seems to have some more emphasis and weight in the US than in Europe, it’s always a good venue to meet new people and talk about the interesting research that is going on.
It turns out that this year there was a New Scientist reporter, Paul Marks, attending and keeping a close eye on the presentations and he’s written a nice article about some of the little tricks that we roboticists can use to make our robots that little bit more life-like. He draws upon some of the work that was presented by Sean Andrist, Ajung Moon, Anca Dragan on robot gaze, and also the work that I published/presented on NLUs, where I found that peoples’ interpretations of NLUs is heavily guided by the context in which they are used.
Essentially what I found was that if a robot uses the same NLU in a variety of different situations, people use the cues from within the context to help direct their interpretation of what the NLU actually meant. Moreover, if the robot were to use a variety of different NLUs in a single given context, people interpret these different NLUs in a similar way. To put it plainly, it would seem that people are less sensitive to the properties of the NLUs themselves, and “build a picture” of what an NLU means based upon how it has been used. This has a useful “cheap and dirty” implication for the robot designer: if you know when the robot needs to make a utterance/NLU, you don’t necessarily have to make an NLU that is specifically tailored to the specific context. You might well be able to get away with making any old NLU, being safe in the knowledge that the human is more likely to utilise cues from the situation to guide their interpretation, rather than specific acoustic cues inherent to the NLU. This is of course not a universal truth, but I propose that this is a reasonable rule of thumb to use during basic HRI… However, I might change my view on this in the future with further experimentation… 😉
Anyway, it’s an interesting little read and well worth the 5 minutes if you have them spare…