3D Audio Conferencing
Initial Experiment Outline

NYNEX Science & Technology, Inc.
Human Systems Interface Technology
John C. Checco, Michele Olson, Debbie Lawrence

Draft: June 28, 1996


This paper outlines an initial experiment integrating 3D audio cues into conference calls. This will provide the first real user interface study of a telecommunications product with this technology.


Does the addition of 3D audio cues enhance conference calls?
(PC: Is this a viable area for NYNEX to pursue?)

Affecting Factors:

The dynamics of conversation as well as teleconferencing present many factors which cannot be controlled. In an effort to create a well-designed experiment, the following affecting factors have been identified. Each of these factors can be varied or controlled, and explanations are given within each. The purpose of documenting factors which are not varied for this experiment is to ensure that these factors will remain constant (not fall through the cracks), and provide a direction for future study.

Bandwidth (Cue Limitation/Clipping):

44kHz, 22kHz, pan delay, no spatialization.

Much research has been done on postulating how humans localize audio. This has resulted in highly detailed definitions of auditory cues. However, this experiment is designed to find the effectiveness of bandwidth (a telecommunication concern) on auditory spatialization. Therefore, the 3D audio signal will be clipped to specific bandwidth requirements, which will probably result in the loss of a range of all cues, not the loss of a single specific cue. The only exception to this rule is the use of pan delay to mimic low-end audio spatialization.

Experiment Use: VARY
Due to delays in getting pan delay technology delivered, the experiment focuses in on 44kHZ, 22kHz, and no spatialzation. Once pan delay technology becomes

available, those experiment cells will be executed.

Navigation Ability:

Head tracking, manual navigation (joystick), no navigation ability.

The ability to navigate can also give measurements as to the awareness a subject may have of 3D audio. With head tracking, a subject may unknowningly start to move their head toward sources -- much like people using hand gestures while on the phone. Explicit navigation mechanisms allow subjects to explore more advanced positioning and placement which may be used as a method for conveying gestures -- like users of text based systems use :-). Bill Chapin of CRE has informally discussed the reaction of subjects when demonstrators position themselves within the subject's personal zone.

Experiment Use: VARY
Due to delays in getting manual navigation technology delivered, the experiment focuses in on head tracking and no navigation. Once manual navigation technology becomes available, those experiment cells will be executed.

Voice Attributes:

Distinguished, indistinguishable.

Sound attributes (depth, pitch, inflection, etc) are used extensively to disseminate participants in existing conferences. The recognizability of voices also plays an important factor. And although there are many factors involved in deciding whether a voice is distinguishable, this experiment focuses on two particular aspects: gender differences and recognizability.

Experiment Use: CONSTANT
While we do not want to skew results by adding sound "helpers" into the experiment, the use of highly ambiguous voices may skew results opposibly. Therefore, subjects shall have no prior knowledge of any of the participants' voices. However, given a conference size of 5 people where one person is the subject, the remaining four people will be a combination of 2 females and 2 males specifically distributed across 5 patterns in the following manner:

P-x = Position x
S-x = Subject for session x
M = Male Actor

F = Female Actor

P-1 P-2 P-3 P-4 P-5
M    F    S-1 F    M
     F    M    S-2 M    F
M    F    S-3 M    F
    F    M    S-4 F    M
M    M    S-5 F    F

The choreography of the conference call shall remain constant to position (not participant), such that the participant at position P-1 shall have the same role, regardless of whether the actor is male of female.

Experiment Briefing:

Blind, Knowledge of 3D audio.

Two of the approaches to attaining the experiment's goal can be to either: test for effectiveness given no prior knowledge of the technology, or given knowledge of the technology, measure the effective use of it during a session. The first approach (blind) can give clues into peoples innate ability to use 3D audio cues where there were previously none. The latter approach argues that in the real world, people will have knowledge of this technology (if they are paying for it), therefore the extent to which it is actively used should be measured.

Experiment Use: CONSTANT
Since each approach results in two "types" of data, both variables should not be covered by a single experiment. For this initial experiment, the first approach was chosen.

Participant Position:

Any combination of azimuth, vertical, and full lateral positioning.

The 3D audio technology allows any sound source to be placed in any XYZ position at any distance. However, extreme positionings have been shown to confuse listeners especially when there is no visible stimulation to complement auditory cues. Again, Bill Chapin has informally discussed the reaction of subjects when demonstrators position themselves above the subject.

Experiment Use: CONSTANT
To mimic normal conference meetings where everyone is physically present, a table model where the positions

of participants lies within the subjects' azimuth is appropriate. There may be use in subsequent experiments to allow 3D sounds which signify external events (these are known as earcons) to lie outside the azimuth.

Subject Role:

Passive, active.

In any conference, there exist both passive and active participants. Passive participants tend to listen without offering any unsolicited input, while active participants can vary from offering occasional input to aggressively leading the course of the conference. Observing subjects as both passive and active participants may yield inconsistent measurements. To provide a static playing field, the experiment will use a highly choreographed topic with the subject and 4 actors. The choreography along with subject briefing should either encourage active participation (by provoking reactions and questions) or passive participation (by preventing inordinate amounts of unsolicited input).

Experiment Use: CONSTANT
For this initial experiment, active participation was desired to increase the subject's innate and explicit use of navigation with 3D audio cues.

Subject Environment:

Isolated, grouped.

Traditional conference calls can occur with the gathering of multiple people in a single room conferencing with another similar group of people, or by bridging in isolated people. There are prototypes being created for group conference calls. In one such prototype, the conference room itself consists of many microphones, cameras, and speakers surrounding participants who face a video wall. Other prototypes use mannequin heads equipped with a camera and speaker to act as a proxy for the remote participant. Still other prototypes place a similarly equipped mannequin on a free-spinning mount at the center of a conference table. All of these prototypes mentioned do not actively spatialize sounds, they merely transport real- world spatialization across high bandwidth pipes. The effect of 3D audio is controlled by how well matched the positions of the sources' microphones and the recipients' speakers are. Thus, the environments for all the participants must be identical.

The technology developed by CRE uses an HRTF to spatialize multiple input sources at locations controlled by the listener (actually the local spatialization engine), not by the source itself. Each listener can then specify different organizational models for spatialization. On the downside, HRTF spatialization requires listeners to use high-quality stereo headphones modified for full separation.

Experiment Use: CONSTANT
This experiment was initiated using the CRE technology. Since listeners would need to wear separate headsets to be tracked individually, it would be a more appropriate environment to keep the subject isolated. It would also prevent out-of-band communication (whispering, gestures, written notes, etc.) that may skew results.

Subject Iteration:

How many times the same subject is run through the experiment.

Subject iteration is used to record differences between the novel use of a technology and its possible normal use. It can also show learning curves, sustained satisfaction level as well as frustration level.

Experiment Use: CONSTANT
This experiment is run blind -- the subjects have no knowledge beforehand of the 3D audio cues. Therefore, single-run subjects can provide valuable information about the effectiveness of varying levels of cues without the subject making a perceived judgement based on prior knowledge. If the initial findings prove successful, further studies can be made into multiple use scenarios.

Conference Size:

3 tracked sources, 2 other live/taped sources, 6 digitally recorded sources.

Typical conference calls can range from several people to large groups of people, depending on its purpose. Usually, participants in large conference calls are passive with some token participants as acting leaders for each group. Small conferences, on the other hand, tend to be highly proactive in nature with high participation from most people.

Experiment Use: CONSTANT
To maintain a high level of interactivity between the subject and other participants, a small conference size

was chosen. The CRE equipment currently limits the number of live sources to 5, only of which 3 are navigatible. Also, having 4 actors can maintain the illusion of a proactive conference, yet keep the direction of the conference from straying. Having 4 actors also allows us to use a balanced number of male/female voices, as explained previously.

Experiment Setup:

The prototype allows for up to 5 live participants: one subject isolated from 4 other participants. These other 4 participants are actors providing pieces of information that the subject will need to recall later.

The experiment will use a single topic throughout all subjects. To restrict the amount of discourse in the conference, the content of the conference is highly choreographed. The topic of "telecommuting" was a chosen because:
- it is highly relevant for any business,
- there are many facets to discuss within this topic,
- it can be very controversial,
- there are many differing views which can be choreographed,
- there is no subject-matter expert which can definitively decide one person's view is better than another, and
- it allows a wider range of subjects to be used in the experiment.

To run the experiment using active participation, employees (either internal or external) would be attending a meeting where their input is needed for some corporate decision on the topic above. They are expected to remain alert, awake, alive, and take notes of the conference call which could then be checked for accuracy of people and facts. Using an active participation model allows us to observe any use of 3D audio cues and possibly navigation learned during a subject's initial experience with the technology that may not have been present during a passive session.

Given the variables of "bandwidth", "navigation", and the combination of "voice attributes" previously discussed, the number of subjects for the initial study is shown in the table below:

Head Tracking Manual Navigation Static Position
44 kHz 5 <TBD> 5
22 kHz 5 <TBD> 5
Pan Delay <TBD> <TBD> <TBD>
No Spatialization NOT APPLICABLE 5

Each subject will be videotaped, but the soundtrack input will be from the subject's headset; thereby recording the subject's spatialized view of the conference call.


Effective measurement of this experiment comes from a combination of task related tests, conversational event observations, visual observations and open-ended as well as evaluative subjective questions. There are three basic areas to explore in this experiment for first-time users of this technology:

Is the 3D effect noticeable?

Task Related:
- Can subject recall specific locations of participants in conference?

- Did subject notice 3D effect?

Conversational Events:
- How often did subject use 3D audio cues (i.e. turn head towards a source, etc)?

Did 3D audio cues facilitate meeting results?

Task Related:
- Can subject recall views of participants at given locations?
- Did subject associate participants with locations?

- Did subject feel 3D audio helped disseminate participants?
- How would subject like to be able to organize participant location?

Conversational Events:
- How often did subject ask for participant names?
- How often did subject prefix comments with their own name?

Do 3D audio cues provide an added value to conference calls?

- Would subject use it again?
- Would subject pay for this feature on a per use basis?
- How much would subject be willing to pay for this feature?