November 5, 2009

Web Architecture and URIs: Recommendation versus Reality

Posted in Tech .:. No comments yet

The goals of the World Wide Web Consortium (W3C), according to its website w3c.org, include “[developing] interoperable technologies (specifications, guidelines, software, and tools) to lead the Web to its full potential.” With any set of guidelines or “good practices” there is often a gap between idealistic theory and reality. In exploring some recommendations of the W3C publication Architecture of the World Wide Web, Volume One (2004) this essay argues for the use of persistent and semantic Uniform Resource Identifiers (URIs), proposes ways to design such identifiers, and examines systems that both obey and disobey such conventions.

Architecture of the World Wide Web begins by outlining the fundamental paradigm of the information space we know as the web. The document describes the notion of a resource (“items of interest”), represented by a global identifier (a URI; side note: URL is sometimes used interchangeably in this discussion since many of the URIs discussed are indeed locaters as well). Upon the dereference of a URI by an agent, a server returns a representation of the resource. Unfortunately, this paradigm is lost by many people whose perception of the web is little more than a glorified file system. While the W3C recommendation states the good practice that “A URI owner SHOULD provide representations of the identified resource consistently and predictably”, anyone who has encountered a “404 Not Found” message knows the problem of dead links. In Cool URIs don’t change (1998), World-wide-web creator Tim Berners-Lee reflects that “There are no reasons at all in theory for people to change URIs [...] but millions of reasons in practice.” He deconstructs common reasons people assert for allowing URIs to die, such as website reorganization, change of technology, or the lack of proper tools for URI management. For example, noting that web technologies and file formats come and go, Berners-Lee recommends the practice of avoiding such elements in URIs in order to future proof them.

While both the architecture document and Cool URIs outline content-negotiation and HTTP response codes as ways to allow URIs to “live forever,” these techniques are relatively high cost as compared to easy content publishing (and removal!) methods like FTP. While it seems unreasonable to require every individual to maintain persistent URLs (indeed, the architecture provides no way to enforce such a lofty goal), any individual who is serious about his web presence should strive for them. Among the many benefits of persistent URIs include the ability to attract links – “links from other websites are the third-most common way people find sites” (Nielsen, 1999) – and the related avoidance of linkrot. Dead links to and on a website undermine credibility, and as Nielsen notes “Linkrot equals lost business.”

Once the importance of URI persistence is established, it makes a lot of sense to spend some time designing these identifiers which will remain indefinitely. Among the many properties of URIs, Architecture of the WWW argues for URI opacity: “Agents making use of URIs SHOULD NOT attempt to infer properties of the referenced resource.” Does this mean that URIs should be pseudo-random text or the like? Certainly not! Usability guru Jakob Nielsen notes that even though URIs are “a machine level addressing scheme[,] users often go to websites or individual pages through mechanisms that involve exposure to raw URLs.” In the area of search, for example, the value of human readable, semantic URIs is a convincing argument against opaque URIs. Such URIs can be a great asset in driving traffic to a website through search results as a URI is one of three short pieces of info (along with page title and description) displayed in the results of major search engines. Nielsen cites a Microsoft Research study that found “people spend 24% of their gaze time looking at the URLs in the search results” and says that these results reconcile with his own research which concluded that “searchers are particularly interested in the URL when they are assessing the credibility of a destination.” In fact, search engines such as Google and Yahoo! will bold a search term when it appears in a URL on a search results page, thus making it all the more helpful to include relevant terms in URIs. Semantic URIs are also important for inbound links from other websites. In the absence of a title tag from the linker, a quick glance at the outbound URI in the browser status bar may be an important factor in click-through-rate.

With the rise of the browser search box and the relative effectiveness of modern search engines, remembering URIs is becoming less important, however, their design remain an important consideration in usability. In URL as UI, Nielsen promotes URIs that allow users to visualize site structure as well as “hackable” URIs that “allow users to move to higher levels of information architecture by hacking off the end of the URL.” For instance, an URI like http://example.org/employees/john is “hackable” in the sense that by removing /john the user might expect to get a list of employees. Such URIs cater to more advanced users and support the navigation objective of the web.

After reading Fast webpage classification using URL features, one might wonder if the authors have ever read the W3C’s recommendation on URL opacity. Citing speed as the primary benefit, Kan and Thi (2005) conducted analysis on URLs by segmenting them into “meaningful tokens.” Their analysis concluded that “URL features correlate with Pagerank [...] allowing prediction of Pagerank within 1 point on average on Google’s 10-point scale.” This study is especially interesting in considering Tim Berners-Lee advice in Cool URIs on what to leave out in creating persistent URIs. He advocates omitting elements like authors name (“authorship can change with new versions”) and subject (to avoid “binding yourself to some classification”) which might be useful in an automated classification of pages by URL. While Berners-Lee’s advice might hurt the prospect of classification by URL analysis, Nielsen’s notion of “hackable” URLs would seem to suggest favorable results for URL analysis.

A nice solution that conforms to Berners-Lee’s naming suggestions is the permalink feature in content management systems like WordPress. This option allows a “permanent” semantic identifier of the form http://example.com/2008/03/07/sample-post/. The identifier includes the publication date and the title of the post, two pieces of metadata that almost certain not to change. WordPress allows posts to be placed in categories, but by omitting this from the post URI, allows the post to be recategorized over time as site structure evolves (a la Cool URIs). Bravo WordPress! On the other hand, URI analysis would likely be much easier with the category tag. Indeed, there seems to be a tension between “Cool URIs” and those that can be easily categorized by analysis like that of Kan and Thi.

On the topic of URI aliases, the Architecture of the World Wide Web states unequivocally “A URI owner SHOULD NOT associate arbitrarily different URIs with the same resource.” Interestingly, video sharing site Viddler.com seems to do exactly that. Viddler appends /[frame number]/ to a video URI to initiate playback (within a Flash player) at that particular frame, thereby associating a number of URIs for each video equal to the number of frames that the video has. According to the web architecture, the preferred way to implement such a system would be to use the fragment identifier. This optional text component following the # sign in a URI “allows indirect identification of a secondary resource”. In this case, the dilemma occurs because the fragment identifier remains client-side and is not passed to the server. The trouble with associating arbitrarily many URIs with the same resource is that it eliminates the advantage of global identifiers and the corresponding network effects as described in the W3C architecture document. Search engines which use links as a factor in ranking of search results, for example, will consider each frame URI as a separate resource. Unfortunately, the convenience of embedding a flash video comes at the expense of associated arbitrarily many URIs with a single resource.

Sometimes poor URI design can result in other users associating arbitrarily many representations to your URIs. Long, non-semantic URLs are good candidates for applications like tinyurl.com which is designed to convert a long hyperlink into a shorter link of the form [tinyurl.com/xxxxx]. While these tiny URLs are handy for passing around in applications like email or Twitter, all context (domain name, etc.) provided by the URI is lost, which may cause users to be weary to follow such a link. Thnlnk.com goes a step further by asking the user to enter a 5-7 word description of the resource and then attempts to generate a more semantic URL based on the description. In my experience, however, these URLs were sometimes as long as the original. Better to design your own URIs correctly in the first place and avoid these context killers!

The architecture of the web is a broad framework that does little to specify aspects of usability and semantics, let along enforce such aspects. Ideas such as “usable URIs” and the “semantic web” lie outside of the architecture and are left up to people to decide and act upon. Ultimately, the onus lies on website administrators and content publishers to create usable URIs and to associate documents with identifiers rather than the reverse. The significant benefits of doing so coupled with the the costs of not doing so, however, should not be ignored by anyone who takes his web presence seriously.

References:
1. Berners-Lee, Tim. Cool URIs don’t change. 1998.
2. Jacobs, Ian and Norman Walsh. Architecture of the World Wide Web, Volume One. December 15, 2004.
3. Kan, M. and Thi, H. O. 2005. Fast webpage classification using URL features. In Proceedings of the 14th ACM international Conference on information and Knowledge Management. ACM, New York, NY, 325-326.
4. Nielsen, Jakob. URL as UI. March 21, 1999.

Get a Trackback link

No Comments Yet

You can be the first to comment!

Leave a comment