by Brent Baccala
baccala@freesoft.org
June, 2002
More recently, Uniform Resource Locators (URLs) have emerged as the dominant means for users to identify web resources. The distinction is not merely one of introducing a new protocol with new terminology, either. URLs are used to name blocks of data, not network devices. Especially with the advent of caching, it's now clear that a web browser may not have to make any network connections at all in order to retrieve and display a web page. "Retrieving" a URL differs significantly from opening an HTTP session, since an HTTP session implies a network connection to a named device, while accessing a URL implies only that its associated data (stored, perhaps, on a local disk) is made available. HTTP, SMTP, ssh, and other TCP-based protocols are inherently connection-oriented, while the URL is inherently data-oriented.
The Internet is moving away from a connection-oriented model and becoming more data-oriented. Since the original Internet design was modeled, at least loosely, after a telephone system, all of its original protocols were connection-oriented. Increasingly, we're becoming aware that often a user is not interested in connecting to such-and-such a computer, but rather in retrieving a specific piece of data. Since such operations are so common, Internet architects need to recognize the distinction between connection-oriented and data-oriented operations and design networks to support both well. Data-oriented models will not replace connection-oriented models; sometimes, you'll still want to make the telephone call. Rather, the pros and cons of each need to be understand, so both can be incorporated into the Internet of the 21st century.
To understand the emergence of data-oriented networking, it is useful to consider the historical development of the Internet. Initially, the driving application for what became the Internet was email, allowing virtually instantaneous communications over large distances, FTP and TELNET were second and third. FTP provided file transfer and a rudimentary publication system; TELNET extended the 1970s command line interface over the network, letting people "log in" over the net, thus allowing remote use and management of computer systems.
Even in these early years of the Internet, the network was becoming more data-oriented than a cursory examination of its protocols would suggest. FTP archive sites, such as uunet and wuarchive, began assembling collections of useful data, including programs, protocol documents, and mailing list archives in central locations. Other sites began mirroring the archives, so that retrieving a particular program, for example, did not require a connection to a centralized server for the program, but only a connection to a convenient mirror site. The practice continues to this day. Of course, accessing the mirror sites required using the connection-oriented protocols, and the process of finding a mirror site or archive that contained the particular program you wanted remained largely a manual process. It still does.
A significant change occurred during the 1980s - the appearance of graphical user interfaces (GUIs) in personal computers by the end of the decade. In the early to mid 90s, the world wide web extended the GUI over the network, much as TELNET had extended the command line interface over the net. More than anything else, the web represents a global GUI, a means of providing the commonly accepted point-and-click interface to users around the world.
It is impossible to understate the impact of the web. The GUI was a critical technology that made computers more accessible to the average person. No longer did you need to type cryptic instructions at a command prompt. To open a file, represented by a colorful icon, just move a pointer to it and click. Yet until the web, you still needed to use the old command-line interface to use the network. Your desktop PC might use a GUI, but connecting to another computer generally meant a venture into TELNET or FTP. The web extended the GUI metaphor over the network. Instead of learning FTP commands to retrieve a file, you could just browse to a web site and click on an icon.
Other technologies could have provided a network GUI, but not as well as HTML and HTTP. X Windows certainly was designed specifically with network GUI applications in mind, but provided so little security that using it to "browse" potentially untrusted sites was never realistic. AT&T's Virtual Network Computing (VNC) is similar to X Windows, and is designed so that its effects can be confined to a single window. With some extensions, it could be used as the basis for a network GUI. However, both X Windows and VNC share a single common major flaw - they are connection-oriented protocols that presuppose a real-time link between client and server. The user types on the keyboard or clicks on a link, then the client transmits this input to the server, which processes the input and sends new information to the client, which redraws the screen. X Windows has never been widely used over the global Internet, because the bandwidth and delay requirements for interactive operation are more stringent than the network can typically provide. VNC is very useful for using GUI systems remotely, but still doesn't provide the performance of local software.
The present HTML/HTTP-based design of the web does have one overwhelming advantage over X Windows / VNC, however. The web is data-oriented, not connection-oriented, or is at least more so than conventional protocols. A web page is completely defined by a block of HTML, which is downloaded in a single operation. Highlighting of links, typing into fill-in forms, scrolling - all are handled locally by the client. Rather than requiring a connection to remain open to communicate mouse and keyboard events back to the server, the entire behavior of the page is described in the HTML.
The advent of web caches changes this paradigm subtly, but significantly. In a cache environment, the primitive operation in displaying a web page is no longer an end-to-end connection to the web server, but the delivery of a named block of data, specifically the HTML source code of a web page, identified by its URL. The presence of a particular DNS name in the URL does not imply that a connection will be made to that host to complete the request. If a local cache has a copy of the URL, it will simply be delivered, without any wide area operations. Only if the required data is missing from the local caches will network connections be opened to retrieve the data.
Experience with web caches demonstrates that data-oriented networks provide several benefits. First, the bandwidth requirements of a heavily cached, data-oriented network is much less than a connection-oriented network. Connection-oriented protocols such as X windows, VNC, and TELNET presuppose a real-time connection between client and server, and in fact could not operate without such a connection, since the protocols do not specify how various user events, such as keyclicks, should be handled. All the protocols do is to relay the events across the network, where the server decides how to handle them, then sends new information back to the client in response. A data-oriented network, which specifies the entire behavior of the web page in a block of HTML, does not require a real-time connection to the server. Having retrieved the data to describe a web page, the connection can be severed and the user can browse through the page, scrolling, filling out forms, watching animations, all without any active network connection. Only when the user moves to another web page is a connection required to retrieve the data describing the new page. Furthermore, since the data describing the pages is completely self-contained, no connection to the original server is required at all if a copy of the web page can be found. A copy, stored anywhere on the network, works as well as the original. As the network becomes more data-oriented, fewer and briefer connections are required to carry out various operations, reducing overall network load.
A data-oriented network is also more resilient to failures and partitions than a connection-oriented network. Consider the possibility of a major network failure, such as the hypothetical nuclear strike that originally motivated the Defense Department to build the packet-based network that evolved into the Internet. Modern routing protocols would probably do a fairly good job of rerouting connections around failed switching nodes, probably in a matter of minutes, but what if the destination server itself were destroyed? The connection would be lost, and no clear fallback path presents itself. The obvious solution is to have backup copies of the server's data stored in other locations, but creating and then finding these backups is currently done by hand. Existing routing protocols can reroute connections, but are woefully inadequate for rerouting data.
A more mundane, but far more common scenario is the partitioned network. Simply operating in a remote area may dictate long periods of operation without network connectivity. In such an environment, it'd be convenient to drop any information that might be needed on a set of CD-ROMs. That works fine until the first search page comes up that connects to a specialized program on the web server, or a CGI script that presents dynamic content, or an image map. Solutions have been developed to put web sites on CD-ROMs - none of them standard, most of them incomplete. A more data-oriented design, that didn't depend on connections to a server, would be far better suited to such situations.
HTML, the workhorse protocol of the web, was never designed with use as a network GUI in mind, even though this is the role it has evolved into. It's the HyperText Markup Language (HTML), and hypertext is not a GUI. Hypertext is text that includes hyperlinks. Perhaps we can expand the definition somewhat into a "hyperdocument" that can include colors, diagrams, pictures, and even animation. A GUI is much more than a hyperdocument, however. A GUI is a complete user interface that provides the human front end to a program. Not only can it include dialog boxes, pull down menus and complex mouse interactions, but more than anything else it provides the interface to a program, which could perform any arbitrary task, and is thus not just a document. The program could be a document browser, a document editor, a remote login client, a language translator, a simulation package, anything. What was pioneered by Xerox PARC, deployed by Apple Lisa, marketed by Macintosh and brought with such stunning success to the masses by Microsoft was not hypertext, but the GUI. The GUI is what we are trying to extend across the network, not hypertext, and thus HTML just isn't very well suited for the task.
Since it wasn't designed to provide a network GUI, HTML doesn't provide the right primitives for the task it has been asked to perform, and thus we've seen a long series of alterations and enhancements. First there was HTML 2, then HTML 3, then HTML 4, now HTML with Cascading Style Sheets, soon XHTML, plus Java applets, Javascript, CGI scripts, servlets, etc, etc... The fact that HTML has had to change so much, and that the changes require network-wide software updates, is a warning sign that the protocol is poorly designed. The problem is that HTML has been conscripted as a network GUI, though, to this day, it has never been clearly designed with this goal in mind. Part of what is needed is a replacement for HTML specifically designed to act as a network GUI.
In addition, one of the great challenges to a data-oriented model is dynamic pages. Presently, web caching protocols provide means for including meta information, in either HTTP or HTML, that inhibits caching on dynamic pages, and thus forces a connection back to the origin server. While this works, it breaks the data-oriented metaphor we'd like to transition towards. To maintain the flexibility of dynamic content in a data-oriented network, we need to lose the end-to-end connection requirement and this seems to imply caching the programs that generate the dynamic web pages. While cryptographic techniques for verifying the integrity of data have been developed and are increasingly widely deployed, no techniques are known for verifying the integrity of program execution on an untrusted host, such as a web cache. Baring a technological breakthrough, it seems impossible for a cache to reliably run the programs required to generate dynamic content. The only remaining solution is to cache the programs themselves (in the form of data), and let the clients run the programs and generate the dynamic content themselves. Thus, another part of what's needed is a standard for transporting and storing programs in the form of data.
An important change in moving to a more data-oriented network would be to replace HTML with a standard specifically designed to provide a data-oriented network GUI. The features of this new protocol:
This misses the point. We're not trying to build an Java-based HTML web browser that would simply achieve cross-platform operability. The goal is to build a web browser that, as its primary metaphor, presents arbitrary Java-based GUIs to the user. HTML could be displayed using a Java HTML browser. The difference is that the web site designer controls the GUI by providing the Java engine for the client to use for displaying any particular page. Switching to a different web site (or web page) might load a different GUI for interacting with that site's HTML, or XML, or whatever. Unlike Andreesen's "Javagator", the choose of GUI is under control of the web server, not tied into a Java/HTML web browser.
For example, consider if a web site wants to allow users to edit its HTML pages in a controlled way. Currently, you have a few choices, none completely satisfactory. First, you could put your HTML in an HTML textbox, and allow the user to edit it directly, clicking a submit button to commit it and see what the page will actually look like. Alternately, you could allow the HTML to be edited with Netscape Composer or some third party HTML editor on the client, accepting the HTML back from the client in a POST operation. This provides the server very little control over exactly what the user can and can't do to the page. Since parts of the page might be automatically generated, this isn't satisfactory, nor do we really know much about this unspecified "third party editor". On the other hand, with a Java browser, the web site could simply provide a modified HTML engine that would allow the user to edit the page, in a manner completely specified by the web designer, prohibiting modifications to parts of the page automatically generated, and allowing special cases, such as spreadsheet tables with the page, to be handled specially.
Another advantage to this proposal is that it provides a solution to a problem plaguing XML - how do you actually display to the user the information you've encoded in XML? This is left glaringly unaddressed by the XML standards, the solution seeming to be that you either use a custom application capable of manipulating the particular XML data structures, or present the data in two different formats - XHTML for humans and XML for machines. A Java-based web browser addresses this problem. You ship only one format - XML - along with a Java application that parses and presents it to the user.
On the other hand, let's keep Andressen's criticism in mind. Java may not be suitable for such a protocol, for either technical or political reasons. The speed issues seem to be largely addressed by the current generation of Just-In-Time (JIT) Java runtimes, but whatever the standard is, it should be an RFC-published, IETF standard-track protocol, and if the intellectual property issues around Java preclude this, then something else needs to replace it. Alternatives include Parrot, the yet unfinished Perl 6 runtime, and Microsoft's .NET architecture, based around a virtual machine architecture recently adopted as ECMA standard ECMA-335.
PDF also deserves consideration. Though it lacks the generality to provide a network GUI, its presentation handling is vastly superior to HTML's, giving the document author complete control over page layout, and allowing the user to zoom the document to any size for easy viewing. It is also easier to render than HTML, since its page layout is more straightforward for the browser to understand.
A definite metaphor shift is required. Rather than viewing HTML as the primary standard defining the web, the primary standard must become Java or something like it, that provides full programmability. Browsing a web page becomes downloading and running the code that defines that page's behavior, rather than downloading and displaying HTML, that might contain an embedded applet.
Backwards compatability can be provided along the lines of HotJava, Sun's proprietary Java-based web browser, which implements HTML in Java. To display an HTML page, Java classes are loaded which parse the HTML and display it within a Java application. The browser provides little more than a Java runtime that can download arbitrary Java and run it in a controlled environment. Initially, 99% of the pages would be HTML, viewed using a standard (and cached) HTML engine coded in Java.
Notwithstanding the creeping featurism present in Java, adopting this approach would avoid the creeping featurism so grossly apparent in web browsers. Even the casual observer will note that mail, news, and teleconferencing are simply bloat that results in multi-megabyte "kitchen sink" browsers. Will the next release of Netscape, one might ask, contain an Emacs mail editor with its embedded LISP dialect? And if not, why not? Only because the majority of users wouldn't use Emacs to edit their mail? Why should we all be forced to use one type of email browser? Why should we have Netscape's email browser packaged into our web browser if we don't use it? Like the constant versioning of HTML, the shear size of modern browsers is a warning sign that the web architecture is fundamentally flawed. A careful attempt to standardize "network Java" would hopefully result in smaller, more powerful browsers that don't have to be upgraded every time W3C revs HTML; you simply update the Java GUI on those particular sites that are taking advantage of the newer features.
Another tremendous advantage is the increased flexibility provided to web designers. HTML took a big step in this direction with Cascading Style Sheets, but CSS doesn't provide the power of a full GUI. For example, if a web page designer wanted to, he could publish an HTML page with a custom Java handler that allowed certain parts of the HTML text to be selectively edited by the user. This simply can't be done using CSS.
Network-deliverable, data-oriented GUIs aren't a panacea, or course. For starters, one of the advantages of the present model is that all web pages have more or less the same behavior (since they are all viewed with the same GUI). The "Back" and "Forward" buttons are always in the same place, the history function always works the same way, you click on a link and basically the same thing happens as happens on any other page. Providing the web designer with the ability to load a custom GUI changes all that. Standards need to be developed for finding and respecting user preferences concerning the appearance of toolbars, the sizing of fonts, the operation of links. The maturing Java standards have already come a long way towards addressing issues such as drag-and-drop that would have to be effectively implemented in any network GUI.
Hurdles need be crossed before we can reach a point where web designers can depend on Java-specific features. One possibility would be to migrate by presenting newer web pages to older browsers using a Java applet embedded in the web page. Performance might suffer, but clever design would hopefully make it tolerable. For starters, consider that the web page data presented to the applet need not be the source HTML, but could be a processed version with page layout already done. Newer, Java-only browsers should be leaner and faster.
The caching model needs work, too, and this is an area where much work is being actively pursued. Hot topics currently being explored include locating caches where data is stored, authenticating the validity of the data in the absence of a connection to the originating server, and establishing automated policies for caching and building replica or mirror sites.
A significant hurdle for building web replicas is the lack of a standard to deliver the executable components that underlay dynamic content. Without such a standard, there is simply no way to automatically replicate a web site, unless its content is completely static. If Java is to become that standard, then Java needs to provide some kind of CGI-type capability.
Java "servlets" are a step in the right direction, since they provide a CGI-type capability that enables a web cache to present dynamic content without a connection to the origin server. Since they are Java-based, they provide solutions to the security issues that surround something like Perl. When mirroring another site, you don't want to simply download their Perl scripts and allow them to run on the replica, because they would have basically no security restrictions there. Java servlets address this issue. They aren't a complete solution, however, because they are tied to the HTTP/HTML model. A more general model would be a variety of applets/servlets/classes interacting to present a GUI to the user, of which a servlet passing HTML via HTTP to an HTML browser is but one possible combination.
Part of the Java servlet specification is WAR (Web Application Archive), an extension to JAR that provides Java servlets, HTML and JSP pages, and XML metadata all packaged up into a single archive file to provide a "web application". In the current implementation, the server administrator "installs" the WAR at a particular URL by loading it onto a Java servlet-enabled web server. If the WAR format were altered slightly to include, perhaps in the XML metadata, a "master" URL, and the servlet-enabled web server were to function more as a proxy, handling requests locally if it possessed a valid WAR, passing them along otherwise, this would be a big step in the right direction. Ultimately, though, to get away from having to trust a proxy to execute WAR content, the client has to execute the content itself. While the present architecture can move in this direction, ultimately by putting WAR-enabled caches on local clients, the current client/server model is fundamentally limited. Ultimately, servers and caches should do nothing but hand out data, and the responsibly for executing it should fall exclusively to the client, not the cache.
Security and authentication are major concerns, especially in a cached environment. In this case, some protocols exist to provide authentication services, yet have many outstanding issues. Some are not widely deployed - DNS key services, for example. The most widely deployed solution - X.509 certificates - has been priced and managed into a realm when only e-business sites can realistically justify their costs. Web security can't be just for those who can and will shell out hundreds of dollars for certificates that keep expiring. In a heavily cached environment, it's easier than ever to spoof somebody's URLs, and X.509-based authentication needs to be in place for 99% of the net's web sites, not 1% of them. Standards exist for storing public keys in DNS (KEY and CERT resource records), which can be used to validate signed JAR/WAR files.
Mainstream caching research also tends to largely ignore the most successful example of a cached network service - Napster and its various spinoffs, most notabally Gnutella, which seem to go by the buzzword peer-to-peer file sharing, or P2P. For example, RFC 3040, "Internet Web Replication and Caching Taxonomy", a January 2001 document discussing "protocols, both open and proprietary, employed in web replication and caching today," never mentions the word "Napster". Since peer-to-peer was designed to share music and not HTML documents, the oversight can be forgiven, but this point needs to be made and made strongly - Napster, Gnutella, and friends _are_ caching services, and by far the most successful ones built to date. The vast majority of the information available through Gnutella, for example, is simply copies of information obtained elsewhere.
Peer-to-peer has evolved to fill a different role in the caching field than offerings like Squid. Squid caches single, small files on-the-fly, without much manual configuration, building its cache in a fashion very similar to a MAC bridge. P2P caches large files that have been manually selected for caching, and relies on the user to make the tricky decision of what is to be saved and what can be discarded. Several problems are apparent in moving the Squid model towards the Napster model. First, the shear volume of URLs seems to preclude any Napster-based model where the URL was the primary index. The amount of traffic generated by searching for a single URL would be astronomically prohibitive.
One solution to this problem is to build a peer-to-peer caching model based around WARs, not URLs. Since a WAR is an archived collection of URLs, presumably there would be far fewer WARs than URLs. Searching for a URL would return hits for the WAR that includes that URL. Thus, a peer-to-peer cache that stored WARs as single entities would allow all the HTML, all the dynamic backends, all the content for a single DNS name to be packaged up into a small group of cached WARs.
For example, freesoft.org's most popular offering is the Internet Encyclopedia, homed at http://www.freesoft.org/CIE/, which includes the full text of all the RFCs, at http://www.freesoft.org/CIE/RFC/. A logical caching scheme would be to separate the Encyclopedia from the rest of the website, and separate the RFCs from both. This could be accomplished by building separate WAR files for the Encyclopedia and for the RFCs. A nice touch would be to move everything else on the site, except the home page and its related content, to a separate WAR, and thus offer a nice, compact WAR for the www.freesoft.org home page. Now, anyone who wants to browse this site would go first to the home page and fetch the http://www.freesoft.org/ WAR. If the user then clicked on the icon for the Internet Encyclopedia, the client or its local cache would fetch from a peer-to-peer cache the http://www.freesoft.org/CIE/ WAR, with the entire content of the Internet Encyclopedia packaged up into a single file.
For more rapid response time, the Range: header could be used to retrieve first the WAR file's table of contents, then the compressed data of the particular URL, resulting in a retrieval time comparable to straight HTTP, ignoring the search time required to find the cache item to begin with and the compilation/startup time of any dynamic code (both of which may be significant). Of course, in addition to such a "partial retrieval", a peer-to-peer file sharing service could do a "full retrieval", obtaining the entire packaged WAR and begin sharing it with others. The decision of how to choose between partial and full retrieval is left "for further study", in other words, the user has to make those decisions manually until we figure it out better. Napster has demonstrated that letting the users make caching decisions manually is workable, so long as the cache items are reasonably sized (not too large or too small) and well labeled.
In summary, I recommend the following steps:
Hypertext Markup Language (HTML) / Extensible Markup Language (XML)
RFC 1737 - Functional Requirements for Uniform Resource Names, a short document notable for its high-level overview of URN requirements
RFC 1738 - Uniform Resource Locators, a technical document of more importance to programmers than architects
RFC 2536 - DSA KEYs and SIGs in the Domain Name System (DNS)
RFC 2538 - Storing Certificates in the Domain Name System (DNS)
RFC 3110 - RSA/SHA-1 SIGs and RSA KEYs in the Domain Name System (DNS)
PDF Reference: Adobe portable document format version 1.4
(ISBN 0-201-75839-3)
http://partners.adobe.com/asn/developer/acrosdk/docs/filefmtspecs/PDFReference.pdf
Ghostscript - freely available Postscript interpreter that also reads and writes PDF and thus can be used to convert PS to PDF
Multivalent (see below) includes a Java PDF viewer
html2ps - a largely illegible Perl script written by Jan Karrman to convert HTML to Postscript. Yes, it can be done.
Bill Venner's excellent Under the Hood series for JavaWorld
is a better starting point than the spec for understanding JVM.
He also has written a book - Inside the Java Virtual Machine
(McGraw-Hill; ISBN 0-07-913248-0)
http://www.javaworld.com/columns/jw-hood-index.shtml
Java 2 language reference
Java languages page
http://grunge.cs.tu-berlin.de/~tolk/vmlanguages.html
Criticism of Java
http://www.jwz.org/doc/java.html
Parrot - Perl 6 runtime
http://www.parrotcode.com/
Microsoft's .NET architecture includes the Common Language Infrastucture,
based around a virtual machine, now adopted as ECMA-335
http://msdn.microsoft.com/net/ecma/
http://www.ecma.ch/ecma1/STAND/ecma-335.htm
Multivalent, an open-source web browser written totally in Java,
with an extension API to add "behaviors" similar to applets
http://www.cs.berkeley.edu/~phelps/Multivalent/
NetBeans, an attempt to develop a "fully functional Java browser"
http://netbrowser.netbeans.net/
Jazilla, a now defunct attempt to carry the "Javagator" project forward
under an open source banner
http://jazilla.sourceforge.net/
Java Servlets - server-side Java API (CGI-inspired; heavily HTTP-based)
The Java servlet specification includes a chapter specifying the WAR
(Web Application Archive) file format, an extension of ZIP/JAR
http://java.sun.com/products/servlet/
RFC 2186 - Internet Cache Protocol (ICP), version 2
RFC 2187 - Application of ICP
Squid software
http://www.squid-cache.org/
NLANR web caching project
http://www.ircache.net/
Various collections of resources for web caching
http://www.web-cache.com/
http://www.web-caching.com/
http://www.caching.com/
IETF Web Intermediaries working group (webi)
http://www.ietf.org/html.charters/OLD/web-charter.html
IETF Web Replication and Caching working group (wrec)
http://www.wrec.org/
RFC 3143 - Known HTTP Proxy/Caching problems
Cache Array Routing Protocol (CARP) - used by Squid
http://www.microsoft.com/Proxy/Guide/carpspec.asp
http://www.microsoft.com/proxy/documents/CarpWP.exe
RFC 2756 - Hypertext Caching Protocol (HTCP) - use by Squid
Napster's protocol lives on, even if the service is dead. It's basically
a centralized directory with distributed data
http://opennap.sourceforge.net/
http://opennap.sourceforge.net/napster.txt
Gnutella has emerged as the leading post-Napster protocol, employing
both a distributed directory and distributed data
http://www.gnutella.com/
http://www.gnutelladev.com/
http://www.darkridge.com/~jpr5/doc/gnutella.html
Several popular clients use the Gnutella network and protocol
http://www.morpheus-os.com/
http://www.limewire.org/
http://www.winmx.com/
Other proprietary peer-to-peer systems
http://www.kazaa.com/
Other free peer-to-peer systems
http://www.freenetproject.org/
1 messages sorted by: [ author ] [ thread ] [ subject ] [ attachment ]
Starting: Mon Jun 24 2002 - 00:42:41 EDT
Ending: Mon Jun 24 2002 - 00:42:41 EDT
| Subject | Author | Date |
| Subscribing to the list | Brent Baccala | Mon Jun 24 2002 - 00:42:41 EDT |
Last message date: Mon Jun 24 2002 - 00:42:41 EDT
Archived on: Mon Jun 24 2002 - 00:42:42 EDT
This archive was generated by hypermail 2.1.0 : Mon Jun 24 2002 - 00:42:42 EDT