Visual vs logical (semantic) formatting

Readings

pp. 13-22

The semantic web -- or 'help the robots'

Tim Berner's Lee who first came up with HTML, writing in 1998 had this to say about his vision of the 'semantic web':

"The Web was designed as an information space, with the goal that it should be useful not only for human-human communication, but also that machines would be able to participate and help. One of the major obstacles to this has been the fact that most information on the Web is designed for human consumption, and [...] that the structure of the data is not evident to a robot browsing the web. [...T]he Semantic Web approach instead develops languages for expressing information in a machine processable form.

[The goal of the Semantic Web approach is] to take us, step by step, from the Web of today to a Web in which machine reasoning will be ubiquitous and devastatingly powerful."

Here are three examples of how this is already happening.

Finding content via a search engine website

The machine: Googlebot -- Google's software to scan web pages and rank them according to search relevance.

How the machine helps: Users of google.com can find sites related to their interests by typing in terms to search for.

Mennonite Historical Library

Coding/language that makes this possible:

<title>Home - Mennonite Historical Library</title>
...
<h1>The Mennonite Historical Library</h1>

In contrast, Googlebot wouldn't know how to classify a page that used this sort of graphical banner, even though it's perfectly clear to a human with working eyes what it's all about:

Adding events to your personal calendar

The machine: Operator Firefox Addon.

How the machine helps: When you see event-related information on a web page, you can add it to your on-line calendar with a few clicks, without having to re-type.

Operator screenshot

Coding/language that makes this possible:

<div class='event'>
<span class="summary"><a href="http://gconline.goshen.edu/...">Yoder Public Affairs Lecture: "Iran and the U.S.: Prospects for a Better Future" - Dr. Trita Parsi</a></span>
<abbr class='dtstart' title='20081105T193000-0500'>7:30 pm, </abbr> <span class="location">Rieth Recital Hall</span>

When marking up a web page, you can choose any names you want to for CSS classes and IDs. But here the names chosen are the ones agreed on for the hCalendar "micro-format".

News 'feeds'

The machine: Google Reader (but could be any number of other RSS feed-readers).

How the machine helps: Displays a list of only the latest additions to a website.

Google Reader screenshot

Coding/language that makes this possible:

Get to the feed by following the RSS feed icon on the China SST website. Look at the source code to this which looks like...

<item>
    <title>SSTers in Langzhong</title>
    <link>http://www.goshen.edu/b/l/sst-china08/2008...</link>
    <description>...On October 30 we spent the day with the SSTers...</description>
    <pubDate>31 Oct 2008 04:07 EST</pubDate>
</item>

This is XML. (Some of you may have heard this called 'RSS' - Really Simple Sindication. Yes. RSS is one kind of XML.)

From HTML to XML and back to XHTML

Early HTML

The very first versions of HTML made some distinctions between visual markup and logical markup. For example:

<i>..</i> makes text italic, but,
<em>..</em> is supposed to emphasize text, but leaves it up to the browser whether to emphasize by making text italic or bold.

But on the whole, it was mostly a language for visually marking up text. The epitome of this was the <font color="red">...</font> tag.
HTML syntax is quite forgiving:

Case of tags and attributes is not important
It doesn't matter if attributes are in quotes or not
Tags don't absolutely have to be closed, e.g. <p>blah.<p>blah is just fine.

This works fine for straight web pages, but now that you've got that big ol' archive of press-releases in HTML, it sure would be nice to take those documents and make them available for... the much smaller screens of a PDA, or pump these press releases onto an e-mail listserv. This is the kind of task that you'd like to automate and have a computer program do, but "where's the headline?" Is it always the first <h1> on a page? Or is it sometimes and <h2>, or...--the computer program may be frustrated by the lack of structure.

It would be much easier to automatically figure out what pieces of a page "meant something" if the page was coded in...

XML - eXtensible Markup Language

An XML document contains no visual markup whatsoever. All the information is marked up according to its information function. For example:

<headline>Canadian Brass coming to Goshen College</headline>
<datemonth>November</datemonth>
<dateday>17</dateday>

Then, a different stylesheet (using "XSL") can be used to format the document for the web, or for a mobile phone screen, or for e-mail, or whatever.
Furthermore, such a document would have to obey a rather strict set of rules set forth in a "Document-Type-Definition" (DTD), and the document would have to indicate the location (URL) of the DTD.

This is worth it if you've got a huge collection of documents, but not if you've got 3 or 4 or 1.

XHTML

XHTML is somewhere in-between. It is formally speaking, an XML dialect, and so must have a DTD declaration, and syntax must be obeyed. Syntax rules include:

tags and attributes must be lower case,
all attributes must be in quotes
all tags that can be closed must be closed
tags that don't have a closing tag must use the <br /> or <img /> syntax with a closing slash inside the tag.

Increasingly, organizations are hedging their bets for the future by requiring outside webdesign companies to code their websites to XHTML standards, so you need to know about it.

Where does CSS fit in?

Notice that by using CSS classes you are actually moving in the direction of marking up your web pages according to function rather than (only) appearance. We also call this separation of design and content.