Data Management Problems Today

Users are given little help managing and visualizing large data sets. For example, a user surfs to a page (or creates a page, is emailed a page, reads a news page, or ftps a page) and now wants to ensure that it can be found again. Here's what the user has to do today:

Issue a command to save the page.
Place the page.
Name the page.

The user gets very little help to do any of these, and besides, all three operations are questionable. Further, after saving pages, the user still needs to:

Find pages.
Categorize pages.
Navigate through the set of pages.

And today, the user gets very little help to do that either.

Today's operating systems were designed in an era when managing a few hundred pages was all that was required. So today's users are given the hardware to store millions of pages and the software to manage only a few hundred. All that excess work has been put on the user's shoulders.

Saving Pages

Why should the user have to explicitly issue a save command? If the page was interesting enough for the user to look at it should be saved automatically. A copy of it already exists on the user's hard drive anyway, so it's only a question of copying it out of whichever cache it's in to some more permanent place.

The space needed to copy the page isn't much of an issue now that a gigabyte costs less than $12 (in September, 1999). Most text pages average about 4 kilobytes and most image pages average about 80 kilobytes, so $900 can buy fast access to about 20 million text pages or about 1 million image pages.

The time needed to copy the page isn't very important either since disk transfer rates are now around 30 megabytes per second. If the user takes 2 seconds to type a name, making the user manually save a 5 kilobyte text page takes twelve thousand times longer than otherwise.

Of course if the page is multimedia, or if it is being fetched over a slow network line, it still pays to keep the user involved since 90 minutes of high-quality audio takes 1 gigabyte, and 3 minutes of high-quality video can also take 1 gigabyte. That's okay, however, since there can be only a few such very large pages, and bandwidth is still very low, so the effort involved is small.

Placing Pages

Why does the user always have to give the system a location for the page? The computer could, at least sometimes, figure out a reasonable place to put the page. If the user is reading news and has saved the last fifty pages to a certain directory then chances are good that that's where the user wants this new one to be saved too. Of course, not all situations are so simple, but the point is that today there is no alternative to having to place the page.

Further, today's operating systems often have no defaults for locations to place various kinds of pages.

The Windows operating system, for example, doesn't realize that the user is likely to want save a page in the same place the user is in currently (in fact, it doesn't even seem to know that users have current locations at all, so it can't even try to predict where to save pages).

The Macintosh operating system has a notion of where users are at present, but only in the weak sense that it remembers where the user last saved a page. It pays no attention to the fact that the page the user is trying to save right now might be completely different from the page the user saved last Tuesday.

The Unix operating system at least has a notion of a user being in a certain place, so whenever a user tries to save a page it defaults to the present directory---or, rather, the directory the program was started in (which may not be the same thing). This can lead to problems, however, when a user starts a mail or news or edit session in a particular directory then hops around. The user has to continually remember that any pages saved aren't saved to the current location.

Naming Pages

Why must the user always have to name the page? Naming all pages only makes sense in a world where no page is stored unless the user stores it, which means a world of a few hundred pages, but it's unreasonable in a world of tens of thousands of pages.

A page name might be meaningful if the user created the page by hand, but is likely to be less so if it was simply bookmarked or saved from mail or news or via ftp. These different degrees of user involvement should be handled differently; it makes little sense to ask the user to name a page that the user has simply browsed briefly. Even when the user names a page, several months later that particular name is likely to be only a distant memory.

The sole advantage of naming a page versus having it named automatically is that the user is more likely to remember the name later. And that is only necessary because the user's computer gives little aid to finding pages other than through their names.

Finding Pages

To find pages users must remember page names. Worse, they also have to remember trivial details of each name---whether it's upper-case or lower-case, whether it's hyphenated or concatenated or spaced, how it's spelled.

Users can also search for a page by a few other attributes: typically, date, size, type, and content, but these searches are quite primitive (particularly in Unix). Content search today consists of conjunctions of single words. Also, pages are not pre-indexed, nor are search results saved, so each content search is potentially expensive and consequently is done rarely.

Further, simply being able to find one page out of thousands isn't enough. Often, when working on a page, users also want various related pages close to hand. This becomes a problem thanks to the next issue: categorizing related pages.

Categorizing Pages

Today the user must put each new page in some unique place in a single tree. Users can't dynamically reorganize their pages---everything must fit within a single hierarchy, even though most pages really belong in several categories. Consequently, naming a page is synonymous with both placing it and categorizing it.

Today, the name of a page is a mere extension of its category---which is a concatenation of all the directories it's in starting at the root of the tree. It's real name is the same as its category (which is itself the same as its place) concatenated with its supposed "name."

Further, although symbolic linking is allowed in all three major operating systems (aliases on the Macintosh, shortcuts in Windows, and links in Unix), it is awkward in all three and must be done by hand in any case. Lacking support for easy and extensive symbolic linking, most users don't use it at all. And of course there's no automated linking at all.

Categorizing pages today consists solely of putting related pages in a new directory. It's up to the user to make sure that the pages in that directory are indeed related; the computer provides no support that this is in fact so. Further, that new directory appears as visually anonymous as any other, and its placement relative to other directories is not at all significant. So no portion of the user's memory, besides the memory of the page's name, is being exploited to help the user find the directory again. And that becomes a problem thanks to the next issue: navigating among pages.

Navigating Among Pages

Navigation today is almost solely name-based. Users can move around their hierarchies only if they can remember the name of a directory to go to. Many users, however, more often remember appearance, placement, content, or context. A world of names forces everything to depend on a very fragile part of human memory.

Real navigation is pretty nearly impossible on the desktop today since there are few landmarks, and no maps. The desktop isn't really a "space" at all. Icons can be added, moved, overlapped, and deleted, but their appearance and placement have no meaning to the computer. The Windows operating system is particulary bad in this respect---if you insert or otherwise alter some part of a directory in a directory, the inner directory moves!

The only intersection between a user's understanding of the space-likeness of the desktop and anything the computer understands and (sometimes) pays attention to is the user's current location within the tree of directories. That is the entire extent of the computer's understanding of space today. It doesn't even keep track of the locations users visit most often, or those that users visit when doing various things.

Further, directories and pages, and the programs that operate on them, do not even know that they are in a tree, let alone that they are on a desktop. All they know is their name---which is the same as their path from the root. Consequently, they aren't even aware of their siblings in the tree. So from a page's point of view, it's not even inside a tree---it's at the end of a linear list.

Since all pages are placed in a tree and all naming, placing, accessing, and categorizing is based on that tree, today's desktop only appears to be a two-dimensional space, but is in fact a one-dimensional space. This presents serious interface problems:

Today's machines cannot deduce anything about the properties of pages from the placement (or rearrangement) of their icons. For example, dragging an icon all the way over the desktop to a corner containing a garbage can icon, with no other icon anywhere nearby, does not tell the computer that the user means to put the icon in the garbage, so if the user is off by even a few pixels the dragged icon is simply left on the desktop rather than being put in the trash.
Today's machines do not let users place page icons together yet still be able to see them. The only way to cluster pages in any sense that the machine understands today is to put them in a directory, and that occludes the pages contained in the directory.
Today's machines cannot be told that certain pages belong together simply by moving their icons closer together. Putting the most frequently used icons near the top and to the right of the screen, for example, is not a noticeable fact to these computers. To do that categorization today users must place icons in a single directory---and, thanks to the rigidity of the hierarchy and the difficulty of linking---they've then lost the opportunity to rearrange them easily. Users can't easily say things like: "these ten things all belong together, but these three within that set of ten also belong with these other fifteen, who all belong together."
Today's machines cannot capture any information about a user's context that's related to the spatial arrangement of icons the user is selecting at present or had selected in the past, so it cannot help the user do spatially-related tasks or do those tasks itself. For example, the machine cannot help the user re-place an icon in case of accidental erasure or rearrangement. There's no way to say "find the icon that used to be here."
Today's machines cannot link space and time---they cannot tell that a user is clicking on certain icons in a certain geometric sequence because they have no notion of geometry. The machine can't tell, for example, that a user is clicking a particular row of icons inside a directory since the placement of those icons isn't noticeable---as far as the machine is concerned they are all in the same place since they're all in the same directory.

Conclusion

All of the above operating system choices made sense fifteen years ago when the basic assumptions behind today's desktops were being decided. Computers back then had virtually no cycles and no memory compared to today's machines. Being stressed just to do simple computations, they lacked the resources to do all but the barest minimum information management. Today that is no longer true. Most computers today are idle the vast majority of the time---even when we use them.

Further, when today's operating systems were being designed there was no such thing as the web. Since then there's been a communications revolution. Nowadays, communicating information across computers has become straightforward. We can now access over half a billion webpages, with more coming all the time, but we have only the same fifteen-year-old tools to manage all that information.

Finally, storage prices have plummeted and gigabytes of data are now on everyone's desk. Soon, even terabytes will be cheap. Massively increased cycles, increasingly easy information communication, and an ever increasing ability to store massive amounts of information lead to the obvious conclusion is that it's time to put those empty cycles to use and relieve ourselves of some of the horrendous burden of information management.