Orukter Engine code contains all required classes to provide clustering and filtering for rest of KnownSpace.

All subpackages of {@link org.datamanger.engine} except {@link org.datamanger.engine.filter} consist implementation - they may be used but KnownSpace developer using Engine does not have to know about them.

When clustering is done following algorithm are used in clustering simpleton as described bellow. It is only done for document that were yet not filtered.

  1. We perform LSI periodically and update coordinates for each document.
  2. Find the clusters, i.e. the sets of documents, that the new one is "close enough" (defined by simialrity threshold betwen 0 and 1).
  3. Add this document to each of those clusters.

When new document arrives we take the following steps:

    We do filtering and produce {@link org.datamanager.engine.filter.WeightedWordList}. We filter document content to remove HTML tags, stop words etc. (see {org.datamanager.engine.cluster.EngineDocument}).
  1. Use LSI matrix to find coordinates of the new document(s).

The following objects exist in clustering:

(a)
{@link org.datamanager.engine.cluster.EngineDocument} -- it is just document content string coupled with WeightedWordList and pointing to original Entity.
(b)
{@link org.datamanager.engine.cluster.EngineCluster} keeps list of EngineDocuments belonging to this cluster, and it also has a name and pointer back to Engine. The EngineCluster needs to have some public API so that the user interface guys can manipulate it defined in {@link org.datamanager.engine.Cluster}.
(c)
{@link org.datamanager.engine.cluster.EngineManager}: this is the main class that manages everything.
(d)
{@link org.datamanager.engine.cluster.Coordinates} class: it calculates coordinates for a document.

5.
The Engine keeps list of clusters, and a list of documents. All documents are clustered (possibly cluster may contain only one document)

6.
There is 1 simpletons (EngineClusteringSimpleton):

This simpleton goes around the pool searching for document entities. For each document entity it finds, it creates and EngineDocument, wraps it into an entity and attaches it to the original document entity. (Note that the EngineDocument is created on applying filtering to the document).

Periodically it performs described above clustering.

Those algorithms (especially incremental and overlapping clustering) should be researched and refined by future engineers.


Aleksander Slominski
and
Chao Mwachofi

Last modified: Mon Dec 13 14:29:53 EST 1999