SEPTEMBER/OCTOBER 2000 3 change the underlying hardware.3      The handheld device’s user
interface will combine voice, text,
and graphics—the different media
will complement each other. Take
for example the MiPad, the next-
generation PDA under develop-
ment at Microsoft.4 It combines a
small color screen with familiar
icons (e-mail, to do list, address
book, and calendar), hyperlinks,
and forms with stylus input and
speech input and output. Instead of
having to type long text messages
on a small touchscreen, you can
tap on the entry form and simply
say the words—the device will
display the words you speak. Tap-
ping on specific areas of the screen triggers
the speech-recognition engine, which nar-
rows down the recognizable list of words
and improves the language understanding
rate. Conversely, some long text messages
can be spoken rather than displayed to you.
For example, as you point to a location on a
map, the device will tell you about the clos-
est restaurant, without obfuscating the
small map with unnecessary text labels. It
performs speech recognition and natural
language understanding on dedicated net-
work servers—it has the ability to under-
stand complete sentences in the context of
an ongoing dialogue, which improves tech-
nology beyond simply recognizing short
phrases. For example, within the context of
the Oxygen project, the MIT Spoken Lan-
guage Systems Group is developing sys-
tems that engage in a conversation with
users in domains such as weather, flight,
and tourist information.5
From HyperText to HyperLanguage      Small handheld devices will soon be
ubiquitous. Although users currently
browse the Web mostly offline, they will
browse online much more frequently in the
future.
     When necessary and appropriate, the
small display will feature speech input and
output. The next version of HTML must
combine several features needed to support
small display sizes and multimodal user
interfaces. Currently, there are several lan-
guages available, each one specialized for a
particular modality—for example, WML
uses the concept of a deck of cards to
model interaction over a sequence of very
small pages. The VoiceXML standard
describes the dialog flow between the user
and the server using speech input and out-
put. A markup language for voice commu-
nication must model the various steps in a
dialog, such as questions, answers, correc-
tions, and restarts. Macromedia’s Flash and
the SMIL language can model interactions
that involve the simultaneous use of text,
graphics, audio, animation, and video.
We are about to face changes across the board in wireless Internet access: infra-
structure, devices, and user interfaces will
all change dramatically. However, if there
isn’t a common “glue” between devices
and servers connected to the wireless
Internet, the growth will be slower than
expected.
     One of HTML’s most important contribu-
tions has been its ability to provide a glue—
a common language for rendering text and
graphics content across different operating
systems, different file systems, different
display devices, and different
languages. The fact that all servers and
browsers use a common markup language
is one factor in the global spread of Internet
usage. HTML’s successor should be able to
keep up with the multiplication of display
devices of different types and capabilities.
If we can clearly organize the language into
classes of features and embrace, rather than
isolate, different user interface modalities,
we will witness again a rapid growth, simi-
lar to the Web’s early days. In the next gen-
eration of handheld devices, the user will be
able to tap anywhere on a screen and speak
                      to the device, and the machine
                      should be able to either use the
                      
display or speak back. We could
                      
develop Web applications faster
                      
and improve consistency across
                      
modalities if a single markup
                      
language could model the
                      
abstract interaction between a
                      
group of Web page elements and
                      
the user and model how the vari-
                      
ous user interface elements
                      
implement the interaction. This
                      
markup language should be able
                      
to combine all the features avail-
                      
able in all the various languages,
                      
including CHTML, VoiceXML,
                      
and WML. In a markup
                      
language, each page element or
tag might have one or more events that
should trigger actions programmable with a
scripting language such as JavaScript or
ECMAScript.
     Inevitably, different devices will have
different user interfaces. This means that
not all the devices can implement elements
of the next HTML—each device might
implement a subset of the markup tags.
Each specific subset of the markup lan-
guage should define a class of devices and
include the list of tags (page elements) and
events each tag supports.
     If Web application developers know in
advance which tags and events are avail-
able in each class, they will be able to
rapidly develop Web applications targeting
specific classes of devices.  
References 1. S. Prandini and D. Sims, “Wireless Carrier
Technology Roadmap,” www.oreillynet.
com/pub/a/308 (current Aug. 2000).
2. The Wireless Markup Language Tutorial;
www.wirelessdevnet.com/training/WAP/
WML.html  (current Aug. 2000).
3. J. Guttag, “Communications Chameleons,”
Scientific American, Vol. 281, No. 2, Aug.
1999, pp. 56–57; www.sciam.com/1999/
0899issue/0899guttag.html (current Aug.
2000).
4. X.D. Huang et al., “MIPAD: A Next Genera-
tion PDA Prototype,” to appear in Proc. Int’l
Conf. Spoken Language Processing, 2000.
5. V. Zue, “Talking with Your Computer,”
Scientific American, Vol. 281, No. 2, Aug.
1999, pp. 56–57; www.sciam.com/1999/
0899issue/0899zue.html (current Aug.
2000).