(c) Copyright 1993.
At the opposite end of the interaction spectrum is the field of autonomous robotics. For applications in manufacturing, for example, where task procedures are repetitive and task environments are structured and well known, advances have been significant. For applications where non-repetitive, flexible manipulation in unstructured environments is required, however, artificial intelligence (AI) and autonomous robotics are still in relatively early development stages [14]. In general, a large amount of prior domain knowledge and detailed task procedures have to be programmed into today's autonomous robots in order for a task to be performed [e.g. 21]. It is widely believed, in other words, that autonomous robots which can effectively perform practical 'intelligent' manipulations will not be realisable within the near future.
One of the cornerstones of function allocation in human factors engineering is recognition of the fact that, whereas the need for continual manual control is under most work circumstances to be minimised, the opposite extreme of complete autonomy also carries with it a number of potential problems [e.g. 23]. What one typically should strive for rather is to combine what humans are good at with what machines are good at. Heuristic knowledge, creativity, dexterity and 'common sense' all are attributes that are possessed by humans but not (typically) by machines. On the other hand, rapid computation, mechanical power and perseverity are advantages of machines not possessed by humans. As technology advances, the areas at which machines are good are expanding, forcing designers of human machine systems to take these trends into account. In the context of robotic manipulation, technology is bringing improvements to both low level perception in machine vision and precise sensor-based robotic control. Machines (computers) are nevertheless still very poor at high level perceptual and cognitive functions, such as object recognition, situation assessment, decision making and strategic task planning. Clearly, if both human cognitive and perceptual capabilities and robot skills can be employed where necessary, the performance of synergetic human-robot systems will eventually surpass that of either autonomous robots or traditional teleoperators, provided that ergonomic guidelines are employed to facilitate such designs.
Our objective in this article is to describe work underway in our laboratory to advance the concept of the Director/Agent (D/A) mode of supervisory telerobotic control. What we are aiming to achieve is a capability for executing 'intelligent manipulation', especially for environments such as those found in space, underseas, in nuclear facilities, and for various types of surgery. In such domains uncertainties associated with the task environment and the particular task requirements necessitate that suitable cognitive, perceptual and motor skills be available, for perceiving and understanding the environment, for identifying the problems to be solved, for planning and selecting task procedures, and for executing them with strength, precision and dexterity. In the paper we discuss the philosophy underlying the D/A metaphor, describe our efforts towards developing a prototype -- in particular by means of our augmented reality system, ARGOS [6] -- and briefly give an overview of some of our research results.
A more recent goal in the development of telerobotics is telepresence, an extension of the operator replication metaphor that has design objectives analogous to those of master/slave systems. It typically employs such immersive virtual reality techniques as head mounted displays, sophisticated tracking sensors mounted on the operator's head and limbs, and (if possible) force feedback, to control an anthropomorphic multiple degree-of-freedom slave robot. Telepresence systems attempt to provide a 'transparent' man machine interface, thereby transmitting human problem solving and manipulative skills into remote (hostile) environments [15]. The ultimate goal of such systems is to allow the human operator to feel herself 'present' at the remote site and, in a sense, 'be' the anthropomorphic robot [8]. One of the most technologically advanced telepresence systems [19], has already demonstrated significant advantages over traditional teleoperation in a number of tasks.
As mentioned earlier, the major disadvantage of operator replication is that it may be fatiguing for the human operator to maintain continual manual control of the teleoperation task. The second disadvantage is that this mode emphasises transferring human capabilities to the work environment, but does not stress the fact that robots / computational machines have desirable capabilities that humans do not possess. One such quality is patience, for example: in order for a very slowly moving or repetitive task to be carried out in operator replication mode, the human controller would have to persevere in the loop during the entire operation, a circumstance for which some degree of autonomy would be ideal.
In addressing some of these problems, our goal has been to develop a Director/Agent (D/A) mode of control, in which the sensor-based robot does not simply replicate the human operator's movements, as in the operator replication metaphor. Instead, rather than projecting herself actively into the work site, the human operator acts as a 'director', while the robotic system, as the human operator's 'agent', provides such advantages of machine (computer) systems as computation, precision and sensing capabilities. In a sense, the human director can be regarded as being adjacent to, rather than coincident with, the robot at the remote site.
In order for it to carry out its tasks, the robotic agent must receive precise information from the human director. Whereas humans deal well with general concepts and low precision spatial estimates, machines deal best with accurate computations and explicit instructions. To bridge the gap between the director and the agent in a manner that allows easy and efficient communication, the human-robot interface must be designed in an ecologically sound way, such that the human operator is able directly to perceive, identify and locate relevant objects in the environment and issue instructions (e.g. "go there", or "pick up that") that can be easily understood and executed by the robot. Using a telepresence approach to achieve this, for example, would entail providing the means for enacting the required actions as if the operator were there performing the task. With our D/A approach, however, we are striving to achieve what we refer to as virtual control, whereby the operator communicates commands by means of a virtual pointing device, or even a virtual manipulator, while the real manipulator executes the specified action following a prescribed time delay [25]. In addition to the director's being given the opportunity for planning and rehearsing the desired command, she is also relieved of the need for remaining actively involved during its execution. Another direct advantage of the virtual control approach, furthermore, is that the need for supplying force/torque feedback to the operator may be obviated; instead, force information which is limited distally to local robot control systems may be sufficient for achieving even better performance.
To support D/A interaction, the means of communicating between the human director and the robot agent becomes a critical issue. In general, two types of 'languages' can be used in human robot communication: one is command-based and the other is graphic. In the following section we review these communication formats. The remainder of the paper presents the augmented reality tools that we are developing to support D/A communication.
We advocate a more general terminology, 'continuous' vs 'discrete' language, to describe the dichotomy in communication formats. Continuous language is a means of representing information which is distributed continuously, along either a spatial or a temporal dimension. It is therefore in many ways a superset of Sheridan's manual language. In human-machine communication, analogue displays and such input devices as mice and joysticks exemplify continuous formats. In contrast, discrete language consists of independent elements, and is in general a superset of verbal language. In most cases, written text, oral commands and computer programming instructions, as well as discrete switches, keys, buttons and other such interface tools, all fall under the category of discrete communication media.
Because robotic manipulations are inherently spatial and continuous, they would thus appear to be amenable to continuous, or manual, languages. Requiring an operator to translate spatial and continuous goals into a discrete format that the robot can understand is an unnecessary burden and would constitute a poorly designed interface. Nevertheless, discrete (command) languages can be used effectively in human robot communication, but only when the information bandwidth required for the particular task is sufficiently low. In other words, only when the robot has sufficient autonomy, or when the task scenario is very simple (few degrees of freedom), can discrete command languages transfer human operator instructions adequately; otherwise, communication has to be carried out in a continuous format.
Obviously, human robot communication can and should ideally be carried out in multiple complementary modes -- and the human robot interface should be ergonomically designed so that human operators are able to switch between modes easily, and whenever necessary.
The major portion of our work on remote site viewing has centred on applying stereoscopic displays. Previous work in the ETC-Lab has, for example, examined the use of stereoscopic video (SV) displays as an ecologically sound means of improving the user interface [4, 5]. By presenting depth information in a direct way to the user, the system reduces some of the complexity of the task. Whereas monoscopic video images often require the operator to interpret shadows and reflections in order to infer a sense of the spatial relations in the remote scene, stereoscopic video images present that same information in a way that is immediately accessible to the operator, with much less mental processing. This can reduce training times and improve task performance (i.e. faster and/or more accurate).
Stereoscopic displays therefore can clearly improve communication from the machine to the human operator, by presenting information in an appropriate format. Similar improvements can be found in the flow of information from the human to the machine, given that a technique can be found that is sufficiently natural for the human and precise enough for the machine. The major portion of our work towards this end has centred on the concept of employing computer generated (virtual) stereoscopic graphics as a means of interactively probing (real) remote worksites viewed stereoscopically by video, and subsequently using quantitative data thus obtained for communicating spatial command information for telerobotic control. A number of prototypes of this concept, which we refer to as ARGOS, for Augmented Reality through Graphic Overlays on Stereovideo [6], have been developed and examined experimentally, and are reviewed in the following sections.
An overall block diagram of the ARGOS system is given in Figure 1 [25]. In this system, a pair of matched video cameras is used to provide two images of the work site from slightly disparate viewpoints. In order to feed the two video images to the computer using one frame grabber, an electronic mixing circuit is used to combine the video channels into a single channel of interlaced NTSC frames, where one field is devoted to the left view and the other to the right. In our ARGOS-I system we use an Amiga 2500 computer equipped with a Genlock 2300, to generate 30 Hz alternating field stereoscopic images [12]. A MicroSpeed FastTrap 3D Model 8735 Trackball is used as the interactive pointer positioning device.
For ARGOS-II a Silicon Graphics Iris 4D/310 GTX 3D colour graphics workstation
is used as the display and control computer. The frame grabber is a SGI Live
Video Digitiser (LVD), which can grab and digitise 30 video frames per second.
Each combined digital image is separated in the frame buffer of the Iris into
left and right images, which are presented alternately, at 120 Hz, on the Iris
monitor, producing a flicker-free field sequential stereo image [10]. The
viewer's liquid crystal spectacles are synchronised with the Iris monitor by
means of an electronic sync separation circuit. A Spaceball(TM) and a
customised Ascension Bird(TM), both having six degrees of freedom, are used as
input devices. (An active research programme is currently underway in
the ETC-Lab to evaluate systematically a variety of six
degree-of-freedom manipulation schemes. [26,27])
A CRS Plus M1A industrial robot is
currently serving as the robotic manipulator.

a) the robot is sufficiently intelligent to be able to survey the environment and perform the steps necessary to identify, recognise and locate the object, such that a straightforward "find and go to object Q" command is sufficient;
b) the robot already knows where (but not necessarily what) the object is, such that a straightforward "go to object Q" command is sufficient;
c) the robot does not know where or what the object is, but can be told exactly where to go, by means of a "go to {x,y,z}" type of command;
d) the robot is guided to the object by means of continuous manual control.
As discussed earlier in this paper, our particular objectives do not address the kind of modelled environment which would be necessary for option (b), nor do we assume the availability of the level of autonomy that would allow option (a) to be feasible in the general case. Furthermore, as stated before, we are striving to overcome the necessity for reverting to option (d). Assuming availability of the basic robot control capability which is necessary for the remaining option (e.g. inertial navigation), we are striving to apply the overlaid stereographics capability of our ARGOS system to make (c) a feasible option.
As a result of the compelling sense of depth provided by stereoscopic displays, the immediate advantage gained is in one's ability to estimate the relative distances or separations between objects along the depth axis. Absolute judgement of depth, sizes and distances remains a problem, however. For example, whereas it might be straightforward for an observer to determine that object Q is farther away from the cameras than object R, it is generally more difficult to estimate how far away the two objects are from the cameras, or from each other. The reason for this is that, no matter how well designed the stereovideo system is, it is essentially impossible for the display to generate an image which is identical with a real directly viewed scene. That is, there are almost always some differences in magnification, field of view and binocular disparity between the video display and natural direct binocular viewing. It is important to note, furthermore, that even under natural direct viewing conditions humans have difficulty in estimating absolute quantities, such as sizes and distances [22]. This is especially true of environments in which objects are unfamiliar, because of the diminished effect of size constancy cues.
Therefore, although stereoscopic video (SV) is a powerful tool for such relative distance judgement related functions as detecting objects and obstacles, keeping vehicles on track, precise proximity operations, etc., SV alone is not necessarily sufficient for direct communication of control information from the operator to the robot, as well as for functions such as path planning, estimation of clearances, rangefinding, etc. (As an example, rather than going to the effort of driving a remotely controlled vehicle up to a particular passageway or opening, and subjecting it to whatever hazards or inconveniences are present along the way, it would be much more convenient if the operator could estimate accurately beforehand, from a distance, whether the vehicle will be able to pass through the opening when it gets there.)
In addressing this requirement, one of the essential components of ARGOS is a calibrated three dimensional (3D) stereographic (SG) pointer, which can be moved about by the operator within the 3D video image of the remote world, as a sort of 'virtual probe' that can be aligned with any object whose location coordinates are needed. In other words, the operator of such a system would be required not to make direct position estimates, but instead merely to adjust the 3D pointer until it is aligned as accurately as possible with a selected feature of the required target object. It is then left to the software, not to the human, to make the necessary location calculations and direct the manipulator or vehicle there. The principle underlying our design, therefore, is to convert the (difficult) task of making absolute judgements into a (simpler) relative judgement task.

Another important feature of this system is the virtual tape measure option, illustrated in Figure 2 by a solid line joining the top corner of the monitor to the cursor. Just like a real-world tape measure, this can be 'dragged' through space to join, and measure the distance between, any starting point and any other point indicated by the interactive cursor, as illustrated in the figure by the graphic insert. Continuous readouts of cursor positions, as well as tape measure readings, are also possible.
An obviously essential issue associated with the SG pointer capability is
whether or not operators are indeed capable of accurately aligning virtual SG
images with their intended SV target images. A psychophysical experiment
addressing this issue was carried out to examine the ability of subjects to
place a SG pointer at specific points in the 3D SV world. In that experiment,
subjects were required to perform a distance matching task using a balanced set
of virtual and real pointers and virtual and real targets. The objective was
to see whether they would perform just as well (or better) with "virtual tools"
as with real ones. The results of that experiment, reported in [7], indicated
that subjects are indeed able to align virtual pointers with real targets
essentially as well as they are able to align real pointers with real objects.

a) The pathway can disappear while the manipulator moves to the indicated target position;
b) The pathway can be allowed to remain on the screen, as originally drawn;
c) The pathway can be made to remain attached to both the target location and the end effector during task execution.
Both options (b) and (c) present potentially powerful possibilities as 'flight guidance' displays (in aviation parlance) for enhancing virtual control during D/A interaction. Option (b) could be considered the analogue of a 'flight director' display, since it indicates (what might have been) the optimal path (i.e. a straight line) at commencement of the manoeuvre and subsequently, on a continuous basis, how much the manipulator is deviating from it during execution. Option (c) on the other hand can be considered a 'virtual tether', similar to that proposed by de Hoff and Hildebrandt [9]. According to this concept, the virtual tether would provide a continually enhanced display of information about the position and orientation of the manipulator relative to the target. Both cases (b) and (c) assume, of course, some level of involvement of the human director during task execution, if only for monitoring purposes, rather than the complete detachment that is theoretically possible in D/A control.
We have implemented and investigated the potential efficacy of the virtual tether concept, within the context of the laboratory peg-in-hole (PiH) setup shown in Figure 3. In that experiment the ability of subjects to integrate calibrated stereographic and stereovideo displays in a continuous manual PiH task was investigated. The experiment compared performance with real and virtual tether enhancements relative to a baseline condition with no tether. The most significant result of that experiment, reported fully in [16], is that mean number of errors per trial for the virtual tether decreased dramatically (39%) relative to the no tether condition. A graphic illustration of the data from that experiment, reproduced here in Figure 4, illustrates in an integrated fashion how scattering of the PiH insertion data decreases with introduction of each of the two types of tether.

One promising display capability is that of 'virtual landmarks', which can be superimposed onto a remote scene to provide calibrated perceptual anchors, for the purpose of enhancing the operator's depth scaling capabilities. In research on this concept [13], subjects were required to judge separations in depth between pairs of objects for which binocular disparity was essentially the only cue provided. In comparison with their performance with no other aids, subjects were able to make very accurate judgements of the depth separations between the target objects when provided with comparison landmarks of known separation, either real or virtual. They were able to accomplish this in fact without any of the error correcting feedback (adaptation) that was required to perform the task without landmarks. They were also able to maintain accurate performance even when camera separations were surreptitiously changed during the experiment, in spite of the perceptual bias that such changes were shown to introduce otherwise.
The final application of ARGOS which we mention here is based on the ability to superimpose a stereographic image of a real 3D object onto an existing video image of the same object. Clearly, to accomplish this we must assume that we have available a model of the remote site, comprising at the least a physical model of the object and information about its position and orientation relative to the stereo cameras. Such a capability is especially useful for complex telerobotic tasks in relatively structured environments such as space, where precise performance is especially important. One of the functions of the human operator in such environments is to monitor system integrity. An objective of the ARGOS system, therefore, is to facilitate this by providing a visual means for easily comparing the output of the object tracking system with the actual object in space [11]. A further objective is to provide the ability to modify the representation of real-world objects, which would otherwise be impossible with video alone. In structured environments, this should permit such capabilities as enhancement of edges, superposition of more complex flight director displays, and proximity warning information.