/  6
 
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 11, NOVEMBER 2010, ISSN 2151-9617HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG 14
 
Cluster Evaluation of Density BasedSubspace Clustering
Rahmat Widia Sembiring, Jasni Mohamad Zain
 
 
Abstract
– Clustering real world data often faced with curse of dimensionality, where real world data often consist of manydimensions. Multidimensional data clustering evaluation can be done through a density-based approach. Density approachesbased on the paradigm introduced by DBSCAN clustering. In this approach, density of each object neighbours with MinPointswill be calculated. Cluster change will occur in accordance with changes in density of each object neighbours. The neighboursof each object typically determined using a distance function, for example the Euclidean distance. In this paper SUBCLU,FIRES and INSCY methods will be applied to clustering 6x1595 dimension synthetic datasets. IO Entropy, F1 Measure,coverage, accurate and time consumption used as evaluation performance parameters. Evaluation results showed SUBCLUmethod requires considerable time to process subspace clustering; however, its value coverage is better. Meanwhile INSCYmethod is better for accuracy comparing with two other methods, although consequence time calculation was longer.
 
Index Terms
— clustering, density, subspace clustering, SUBCLU, FIRES, INSCY. 
——————————
 
 
——————————
1 D
ATA
M
INING AND
C
LUSTERING
Data
 
mining
 
is
 
the
 
process
 
of
 
extracting
 
the
 
data
 
from
 
large
 
databases,
 
used
 
as
 
technology
 
to
 
generate
 
the
 
re
quired
 
information.
 
Data
 
mining
 
methods
 
can
 
be
 
used
 
to
 
predict
 
future
 
data
 
trends,
 
estimate
 
its
 
scope,
 
and
 
can
 
be
 
used
 
as
 
a
 
reliable
 
basis
 
in
 
the
 
decision
 
making
 
process.
 
Functions
 
of
 
data
 
mining
 
are
 
association,
 
correlation,
 
prediction,
 
clustering,
 
classification,
 
analysis,
 
trends,
 
out
liers
 
and
 
deviation
 
analysis,
 
and
 
similarity
 
and
 
dissimilar
ity
 
analysis.
 
One
 
of
 
frequently
 
used
 
data
 
mining
 
method
 
to
 
find
 
patterns
 
or
 
groupings
 
of
 
data
 
is
 
clustering.
 
Clustering
 
is
 
the
 
division
 
of
 
data
 
into
 
objects
 
that
 
have
 
similarities.
 
Showing
 
the
 
data
 
into
 
smaller
 
clusters
 
to
 
make
 
the
 
data
 
becomes
 
much
 
simpler,
 
however,
 
can
 
also
 
be
 
loss
 
of
 
im
portant
 
piece
 
of
 
data,
 
therefore
 
the
 
cluster
 
needs
 
to
 
be
 
analyzed
 
and
 
evaluated.
 
This
 
paper
 
organized
 
into
 
a
 
few
 
sections.
 
Section
 
2
 
will
 
present
 
cluster
 
analysis.
 
Section
 
3
 
presents
 
density
based
 
clustering,
 
followed
 
by
 
density
 
based
 
subspace
 
cluster
ing
 
in
 
Section
 
4.
 
Our
 
proposed
 
experiment
 
based
 
on
 
per
formance
 
evaluation
 
discussed
 
in
 
Section
 
5,
 
followed
 
by
 
concluding
 
remarks
 
in
 
Section
 
6.
  
2 C
LUSTER
A
NALYSIS
 
Cluster
 
analysis
 
is
 
a
 
quite
 
popular
 
method
 
of
 
discre
tizing
 
the
 
data
 
[1].
 
Cluster
 
analysis
 
performed
 
with
 
mul
tivariate
 
statistics,
 
identifies
 
objects
 
that
 
have
 
similarities
 
and
 
separate
 
from
 
the
 
other
 
object,
 
so
 
the
 
variation
 
be
tween
 
objects
 
in
 
a
 
group
 
smaller
 
than
 
the
 
variation
 
with
 
objects
 
in
 
other
 
groups.
 
Cluster
 
analysis
 
consists
 
of
 
several
 
stages,
 
beginning
 
with
 
the
 
separation
 
of
 
objects
 
into
 
a
 
cluster
 
or
 
group,
 
fol
lowed
 
by
 
appropriate
 
to
 
interpret
 
each
 
characteristic
 
value
 
contained
 
within
 
their
 
objects,
 
and
 
labelled
 
of
 
each
 
group.
 
The
 
next
 
stage
 
is
 
to
 
validate
 
the
 
results
 
of
 
the
 
clus
ter,
 
using
 
discriminant
 
function.
 
3 D
ENSITY
B
ASED
C
LUSTERING
Density
based
 
clustering
 
method
 
calculating
 
the
 
dis
tance
 
to
 
the
 
nearest
 
neighbour
 
object,
 
object
 
measured
 
with
 
the
 
objects
 
of
 
the
 
local
 
neighbourhood,
 
if
 
inter
object
 
close
 
relative
 
with
 
its
 
neighbour
 
said
 
as
 
normal
 
object,
 
and
 
vice
 
versa.
 
In
 
density
based
 
cluster,
 
there
 
are
 
two
 
points
 
to
 
be
 
concerned;
 
first
 
is
 
density
reachable,
 
where
 
p
 
point
 
is
 
density
reachable
 
from
 
point
 
q
 
with
 
Eps
,
 
MinPoints
 
if
 
there
 
are
 
rows
 
of
 
points
 
p1,
 
...,
 
pn,
 
p1
 
=
 
q,
 
pn
 
=
 
p
 
,
 
such
 
that
 
pi+1
 
directly
 
density
reachable
 
from
 
pi
 
as
 
shown
 
in
 
Fig
ure
1.
 
 
————————————————
 
 
Rahmat Widia Sembiring, is with Faculty of Computer Systems andSoftware Engineering Universiti Malaysia Pahang, Lebuhraya TunRazak, 26300, Kuantan, Pahang Darul Makmur, Malaysia
 
Jasni Mohamad Zain, is Associate Professor at Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang, Le-buhraya Tun Razak, 26300, Kuantan, Pahang Darul Makmur, Malay-sia
© 2010 Journal of Computing Press, NY, USA, ISSN 2151-9617
http://sites.google.com/site/journalofcomputing/
 
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 11, NOVEMBER 2010, ISSN 2151-9617HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG 15
 
  
Figure
1
 
Density
 
Reachable
 
 
The
 
second
 
point
 
is
 
density
connected,
 
where
 
point
 
p
 
as
 
density
connected
 
at
 
the
 
point
 
q
 
with
 
Eps
,
 
MinPoints
 
if
 
there
 
is
 
a
 
point
 
o
 
such
 
that
 
p
 
and
 
q
 
density
reachable
 
from
 
o
 
with
 
Eps
 
and
 
MinPoints
,
 
as
 
shown
 
in
 
Figure
2.
 
 
 
  
Figure
2
 
Density
 
Connected
 
 
Density
based
 
clustering
 
algorithm
 
will
 
classify
 
ob
jects
 
based
 
on
 
object
specific
 
functions.
 
The
 
most
 
popular
 
algorithm
 
is
 
DBSCAN
 
[2].
 
In
 
DBSCAN,
 
the
 
data
 
do
 
not
 
have
 
enough
 
distance
 
to
 
form
 
a
 
cluster,
 
referred
 
as
 
an
 
outlier,
 
will
 
be
 
eliminated.
 
DBSCAN
 
will
 
determine
 
for
 
themselves
 
the
 
number
 
of
 
clusters
 
generated
 
after
 
inclu
sion
 
of
 
input
 
Eps
 
(maximum
 
radius
 
of
 
neighbourhood
 
of
 
a
 
point)
 
and
 
MinPoints
 
(minimum
 
number
 
of
 
points
 
which
 
is
 
in
 
an
 
environment
 
Eps
),
 
expressed
 
in
 
pseudocode
 
algo
rithms
 
such
 
as
 
Figure
3
 
[Wikipedia].
 
 
DBSCAN(D, eps, MinPts)C = 0 for each unvisited point P in dataset Dmark P as visited N = getNeighbours (P, eps)if sizeof(N) < MinPtsmark P as NOISE elseC = next cluster expandCluster(P, N, C, eps, MinPts) expandCluster(P, N, C, eps, MinPts)add P to cluster C for each point P' in N if P' is not visited mark P' as visited N' = getNeighbours(P', eps)if sizeof(N') >= MinPtsN = N joined with N' if P' is not yet member of any cluster add P' to cluster C 
 
 
Figure-3 DBSCAN Algorithm
4 D
ENSITY
B
ASED
S
UBSPACE
C
LUSTERING
 
Subspace
 
clustering
 
is
 
a
 
method
 
to
 
determine
 
the
 
clusters
 
that
 
form
 
on
 
a
 
different
 
subspaces,
 
this
 
method
 
is
 
better
 
in
 
handling
 
multidimensional
 
data
 
than
 
other
 
methods.
 
Figure
4
 
(wikipedia)
 
shows
 
the
 
two
 
dimensions
 
of
 
the
 
clusters
 
placed
 
in
 
a
 
different
 
subspace.
 
On
 
the
 
dimen
sion
 
of
 
the
 
subspace
 
cluster
 
c
a1
 
(in
 
the
 
subspace
 
{x}
)
 
and
 
c
b
,
 
c
c
,
 
c
d
 
(in
 
the
 
subspace
 
{y}
)
 
can
 
found.
 
Meanwhile
 
c
c
 
not
 
included
 
in
 
the
 
subspace
 
cluster.
 
In
 
two
dimensional
 
clus
ter,
 
c
ab
 
and
 
c
ad
 
identified
 
as
 
clusters.
 
  
 
Figure
4
 
Subspace
 
clustering
 
There
 
is
 
some
 
discussion
 
of
 
subspace
 
clustering
 
([3],
 
[4],
 
[5],
 
[6],
 
[6],
 
[8],
 
[9],
 
[10],
 
[11],
 
[12],
 
[13],
 
[14],
 
[15],
 
[16],
 
[17],
 
[18],
 
[19]).
 
The
 
main
 
problem
 
in
 
clustering
 
is
 
the
 
clus
ter
 
can
 
be
 
in
 
a
 
different
 
subspace,
 
with
 
the
 
combination
 
of
 
different
 
dimensions,
 
if
 
the
 
number
 
of
 
dimensions
 
higher,
 
caused
 
more
 
difficult
 
to
 
find
 
clusters.
 
Subspace
 
clustering
 
method
 
will
 
automatically
 
find
 
the
 
units
 
clustered
 
in
 
each
 
subspace.
 
As
 
clustering
 
in
 
general,
 
important
 
to
 
analyze
 
in
 
subspace
 
clustering
 
is
 
the
 
problem
 
of
 
density
 
of
 
each
 
data
 
object.
 
In
 
this
 
paper
 
will
 
discuss
 
the
 
application
 
of
 
SUBCLU
 
[20],
 
FIRES
 
[21],
 
and
 
INSCY
 
[22]
 
for
 
subspace
 
clustering.
 
SUBCLU
 
(density
connected
 
SUBspace
 
CLUstering
)
 
[20],
 
is
 
an
 
effective
 
and
 
efficient
 
method
 
in
 
subspace
 
clus
tering
 
problems.
 
Using
 
the
 
concept
 
of
 
density
 
in
 
relation
 
DBSCAN
 
[2],
 
with
 
grid
based
 
approach,
 
this
 
method
 
can
 
detect
 
the
 
shape
 
and
 
position
 
of
 
clusters
 
in
 
the
 
subspace.
 
Monotonous
 
nature
 
of
 
the
 
relationship
 
density,
 
bottom
up
 
approach
 
used
 
to
 
pruning
 
subspace
 
and
 
produces
 
clusters
 
that
 
are
 
connected
 
with
 
density
 
(Figure
5)
 
and
 
which
 
are
 
not
 
connected
 
with
 
the
 
density
 
(Figure
6).
 
 Figure-5 Cluster with density relation
 
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 11, NOVEMBER 2010, ISSN 2151-9617HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG  16
 
 
 
Figure
6
 
Cluster
 
without
 
density
 
relation
  
As
 
long
 
as
 
no
 
unnecessary
 
subspace
 
produced,
 
the
 
result
 
will
 
be
 
the
 
same
 
SUBCLU
 
obtained
 
by
 
DBSCAN,
 
SUBCLU
 
processes
 
can
 
be
 
seen
 
through
 
the
 
algorithm
 
in
 
Figure
7.
 
SUBCLU(SetOfObjects DB, Real ", Integer m)/* STEP 1 Generate all 1-D clusters */ S1 := ; // set of 1-D subspaces containing clustersC1 := ; // set of all sets of clusters in 1-D subspacesFOR each ai 2 A DOCfaig := DBSCAN(DB; faig; ";m) // set of all clusters in subspace ai;IF Cfaig 6= ; THEN // at least one cluster in subspace faig found S1 := S1 [ faig;C1 := C1 [ Cfaig;END IF END FOR/* STEP 2 Generate (k + 1)-D clusters from k-D clusters */ k := 1;WHILE Ck 6= ;/* STEP 2.1 Generate (k + 1)-dimensional candidate subspaces */ CandSk+1 := GenerateCandidateSubspaces(Sk);/* STEP 2.2 Test candidates and generate (k + 1)-dimensional clusters */ FOR EACH cand 2 CandSk+1 DO// Search k-dim subspace of cand with minimal number of objects in the clustersbestSubspace := mins2Sk^s_cand Ci2CsjCij Ccand := ;;FOR EACH cluster cl 2 CbestSubspace DOCcand = Ccand [ DBSCAN(cl; cand; ";m);IF Ccand 6= ; THEN Sk+1 := Sk+1 [ cand;Ck+1 := Ck+1 [ Ccand;END IF END FOREND FORk := k + 1END WHILE 
 
Figure
7
 
SUBCLU 
 
Algorithm
 
 
Second
 
subspace
 
method
 
used
 
in
 
this
 
paper
 
is
 
FIRES
 
[21].
 
FIRES
 
(
FIlter
 
REfinement
 
Subspace
 
clustering
)
 
frame
work
 
based
 
on
 
efficiency
 
filter
 
refinement,
 
by
 
determin
ing
 
the
 
frequency
 
scale
 
quadratic
 
of
 
data
 
dimensions
 
and
 
dimensional
 
subspace
 
clusters.
 
This
 
method
 
can
 
applied
 
to
 
clusters
 
that
 
recognized
 
based
 
on
 
the
 
local
 
threshold
 
density.
 
FIRES
 
consist
 
of
 
three
 
phases,
 
namely
 
pre
clustering,
 
in
 
this
 
phase
 
all
 
the
 
so
called
 
cluster
 
1D
base
 
clusters
 
will
 
be
 
calculated,
 
which
 
can
 
use
 
existing
 
methods
 
of
 
sub
space
 
clustering.
  
The
 
second
 
phase
 
is
 
the
 
generation
 
of
 
subspace
 
cluster
 
approximations,
 
in
 
this
 
phase
 
the
 
exist
ing
 
clusters
 
will
 
combined
 
to
 
find
 
the
 
maximum
 
dimen
sional
 
subspace
 
cluster
 
approach,
 
but
 
not
 
incorporate
 
in
 
apriori
 
style,
 
but
 
using
 
the
 
scale
 
most
 
quadratic
 
of
 
the
 
number
 
of
 
dimensions.
 
The
 
final
 
stage
 
is
 
to
 
post
processing
 
subspace
 
clusters,
 
by
 
smoothing
 
the
 
cluster
 
on
 
the
 
second
 
phase.
 
Another
 
method
 
that
 
was
 
used
 
INSCY
 
[22]
 
(
INdexing
 
Subspace
 
Clusters
 
with
 
in
process
removal
 
of 
 
redundancY 
),
 
used
 
breadth
 
first
 
approach,
 
by
 
performing
 
recursive
 
mining
 
in
 
all
 
parts
 
of
 
the
 
cluster
 
of
 
subspace
 
projections,
 
before
 
passing
 
to
 
the
 
next
 
section.
 
This
 
strategy
 
has
 
two
 
advantages,
 
high
 
dimensional
 
maximal
 
projection
 
will
 
done
 
first,
 
then
 
perform
 
pruning
 
of
 
all
 
loop
 
dimensions
 
and
 
gain
 
efficiency,
 
second,
 
index
ing
 
potential
 
of
 
subspace
 
clusters
 
that
 
may
 
occur.
 
For
 
more
 
details
 
can
 
see
 
in
 
the
 
algorithm
 
INSCY
 
in
 
Figure
8.
 
 
foreach descriptor in scy-tree dorestricted-tree := restrict(scy-tree, descriptor);restricted-tree := mergeWithNeighbors(restricted-tree);pruneRecursion(restricted-tree); //prune sparse regionsINSCY(restricted-tree,result); //depth-first via recursionpruneRedundancy(restricted-tree); //in-process-removal result := DBClustering(restricted-tree) result;
 
Figure-8 INSCY Algorithm
 
5 P
ERFORMANCE
E
VALUATION
 
We
 
have
 
tested
 
SUBCLU,
 
FIRES
 
and
 
INSCY
 
using
 
data
 
sets,
 
with
 
processor
 
1.66
 
GHz
 
and
 
1
 
GB
 
of
 
RAM.
 
 
5.1
 
Data
 
Sets
 
In
 
our
 
experiments,
 
synthetic
 
data
 
was
 
used
 
consist
ing
 
of
 
6x1595
 
dimensions
 
(6
 
attributes
 
and
 
1595
 
instant
 
/
 
data)
 
that
 
our
 
adoption
 
of
 
[12],
 
graphically
 
the
 
initial
 
data
 
are
 
as
 
in
 
Figure
9.
 
 
Figure
9
 
Dataset
 
Distribution
 

Share & Embed

More from this user

Add a Comment

Characters: ...