Orukter Engine code contains all required classes to provide
clustering and filtering for rest of KnownSpace.
All subpackages of {@link org.datamanger.engine} except {@link org.datamanger.engine.filter}
consist implementation - they may be used but KnownSpace developer using Engine
does not have to know about them.
When clustering is done following algorithm
are used in clustering simpleton as described bellow.
It is only done for document that were yet not filtered.
-
We perform LSI periodically and update coordinates for each document.
-
Find the clusters, i.e. the sets of documents, that the new one is "close enough" (defined by simialrity threshold betwen 0 and 1).
-
Add this document to each of those clusters.
When new document arrives we take the following steps:
We do filtering and produce {@link org.datamanager.engine.filter.WeightedWordList}.
We filter document content to remove HTML tags, stop words etc.
(see {org.datamanager.engine.cluster.EngineDocument}).
-
Use LSI matrix to find coordinates of the new document(s).
The following objects exist in clustering:
- (a)
- {@link org.datamanager.engine.cluster.EngineDocument} -- it is just document content string coupled with WeightedWordList
and pointing to original Entity.
- (b)
- {@link org.datamanager.engine.cluster.EngineCluster} keeps list of EngineDocuments belonging to this
cluster, and it also has a name and pointer back to Engine. The
EngineCluster needs to have some public API so that the user interface
guys can manipulate it defined in {@link org.datamanager.engine.Cluster}.
- (c)
- {@link org.datamanager.engine.cluster.EngineManager}: this is the main class that manages everything.
- (d)
- {@link org.datamanager.engine.cluster.Coordinates} class: it calculates coordinates for a document.
5.
The Engine keeps list of clusters, and a list of documents.
All documents are clustered (possibly cluster may contain only one document)
6.
There is 1 simpletons (EngineClusteringSimpleton):
This simpleton goes around the pool searching for document entities. For
each document entity it finds, it creates and EngineDocument, wraps it
into an entity and attaches it to the original document entity. (Note that
the EngineDocument is created on applying filtering to the document).
Periodically it performs described above clustering.
Those algorithms (especially incremental and overlapping clustering)
should be researched and refined by future engineers.
Aleksander Slominski and
Chao Mwachofi
Last modified: Mon Dec 13 14:29:53 EST 1999