### Print this document

### High Quality

Open the downloaded document, and select print from the file menu (PDF reader required).

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 11, NOVEMBER 2010, ISSN 2151-9617HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG 14

Cluster Evaluation of Density BasedSubspace Clustering

Rahmat Widia Sembiring, Jasni Mohamad Zain

Abstract

– Clustering real world data often faced with curse of dimensionality, where real world data often consist of manydimensions. Multidimensional data clustering evaluation can be done through a density-based approach. Density approachesbased on the paradigm introduced by DBSCAN clustering. In this approach, density of each object neighbours with MinPointswill be calculated. Cluster change will occur in accordance with changes in density of each object neighbours. The neighboursof each object typically determined using a distance function, for example the Euclidean distance. In this paper SUBCLU,FIRES and INSCY methods will be applied to clustering 6x1595 dimension synthetic datasets. IO Entropy, F1 Measure,coverage, accurate and time consumption used as evaluation performance parameters. Evaluation results showed SUBCLUmethod requires considerable time to process subspace clustering; however, its value coverage is better. Meanwhile INSCYmethod is better for accuracy comparing with two other methods, although consequence time calculation was longer.

Index Terms

— clustering, density, subspace clustering, SUBCLU, FIRES, INSCY.

——————————

——————————

1 D

ATA

M

INING AND

C

LUSTERING

Data

mining

is

the

process

of

extracting

the

data

from

large

databases,

used

as

technology

to

generate

the

re

‐

quired

information.

Data

mining

methods

can

be

used

to

predict

future

data

trends,

estimate

its

scope,

and

can

be

used

as

a

reliable

basis

in

the

decision

making

process.

Functions

of

data

mining

are

association,

correlation,

prediction,

clustering,

classification,

analysis,

trends,

out

‐

liers

and

deviation

analysis,

and

similarity

and

dissimilar

‐

ity

analysis.

One

of

frequently

used

data

mining

method

to

find

patterns

or

groupings

of

data

is

clustering.

Clustering

is

the

division

of

data

into

objects

that

have

similarities.

Showing

the

data

into

smaller

clusters

to

make

the

data

becomes

much

simpler,

however,

can

also

be

loss

of

im

‐

portant

piece

of

data,

therefore

the

cluster

needs

to

be

analyzed

and

evaluated.

This

paper

organized

into

a

few

sections.

Section

2

will

present

cluster

analysis.

Section

3

presents

density

‐

based

clustering,

followed

by

density

‐

based

subspace

cluster

‐

ing

in

Section

4.

Our

proposed

experiment

based

on

per

‐

formance

evaluation

discussed

in

Section

5,

followed

by

concluding

remarks

in

Section

6.

2 C

LUSTER

A

NALYSIS

Cluster

analysis

is

a

quite

popular

method

of

discre

‐

tizing

the

data

[1].

Cluster

analysis

performed

with

mul

‐

tivariate

statistics,

identifies

objects

that

have

similarities

and

separate

from

the

other

object,

so

the

variation

be

‐

tween

objects

in

a

group

smaller

than

the

variation

with

objects

in

other

groups.

Cluster

analysis

consists

of

several

stages,

beginning

with

the

separation

of

objects

into

a

cluster

or

group,

fol

‐

lowed

by

appropriate

to

interpret

each

characteristic

value

contained

within

their

objects,

and

labelled

of

each

group.

The

next

stage

is

to

validate

the

results

of

the

clus

‐

ter,

using

discriminant

function.

3 D

ENSITY

B

ASED

C

LUSTERING

Density

‐

based

clustering

method

calculating

the

dis

‐

tance

to

the

nearest

neighbour

object,

object

measured

with

the

objects

of

the

local

neighbourhood,

if

inter

‐

object

close

relative

with

its

neighbour

said

as

normal

object,

and

vice

versa.

In

density

‐

based

cluster,

there

are

two

points

to

be

concerned;

first

is

density

‐

reachable,

where

p

point

is

density

‐

reachable

from

point

q

with

Eps

,

MinPoints

if

there

are

rows

of

points

p1,

...,

pn,

p1

=

q,

pn

=

p

,

such

that

pi+1

directly

density

‐

reachable

from

pi

as

shown

in

Fig

‐

ure

‐

1.

————————————————

Rahmat Widia Sembiring, is with Faculty of Computer Systems andSoftware Engineering Universiti Malaysia Pahang, Lebuhraya TunRazak, 26300, Kuantan, Pahang Darul Makmur, Malaysia

Jasni Mohamad Zain, is Associate Professor at Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang, Le-buhraya Tun Razak, 26300, Kuantan, Pahang Darul Makmur, Malay-sia

© 2010 Journal of Computing Press, NY, USA, ISSN 2151-9617

http://sites.google.com/site/journalofcomputing/

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 11, NOVEMBER 2010, ISSN 2151-9617HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG 15

Figure

‐

1

Density

Reachable

The

second

point

is

density

‐

connected,

where

point

p

as

density

‐

connected

at

the

point

q

with

Eps

,

MinPoints

if

there

is

a

point

o

such

that

p

and

q

density

‐

reachable

from

o

with

Eps

and

MinPoints

,

as

shown

in

Figure

‐

2.

Figure

‐

2

Density

Connected

Density

‐

based

clustering

algorithm

will

classify

ob

‐

jects

based

on

object

‐

specific

functions.

The

most

popular

algorithm

is

DBSCAN

[2].

In

DBSCAN,

the

data

do

not

have

enough

distance

to

form

a

cluster,

referred

as

an

outlier,

will

be

eliminated.

DBSCAN

will

determine

for

themselves

the

number

of

clusters

generated

after

inclu

‐

sion

of

input

Eps

(maximum

radius

of

neighbourhood

of

a

point)

and

MinPoints

(minimum

number

of

points

which

is

in

an

environment

Eps

),

expressed

in

pseudocode

algo

‐

rithms

such

as

Figure

‐

3

[Wikipedia].

DBSCAN(D, eps, MinPts)C = 0 for each unvisited point P in dataset Dmark P as visited N = getNeighbours (P, eps)if sizeof(N) < MinPtsmark P as NOISE elseC = next cluster expandCluster(P, N, C, eps, MinPts) expandCluster(P, N, C, eps, MinPts)add P to cluster C for each point P' in N if P' is not visited mark P' as visited N' = getNeighbours(P', eps)if sizeof(N') >= MinPtsN = N joined with N' if P' is not yet member of any cluster add P' to cluster C

Figure-3 DBSCAN Algorithm

4 D

ENSITY

B

ASED

S

UBSPACE

C

LUSTERING

Subspace

clustering

is

a

method

to

determine

the

clusters

that

form

on

a

different

subspaces,

this

method

is

better

in

handling

multidimensional

data

than

other

methods.

Figure

‐

4

(wikipedia)

shows

the

two

dimensions

of

the

clusters

placed

in

a

different

subspace.

On

the

dimen

‐

sion

of

the

subspace

cluster

c

a1

(in

the

subspace

{x}

)

and

c

b

,

c

c

,

c

d

(in

the

subspace

{y}

)

can

found.

Meanwhile

c

c

not

included

in

the

subspace

cluster.

In

two

‐

dimensional

clus

‐

ter,

c

ab

and

c

ad

identified

as

clusters.

Figure

‐

4

Subspace

clustering

There

is

some

discussion

of

subspace

clustering

([3],

[4],

[5],

[6],

[6],

[8],

[9],

[10],

[11],

[12],

[13],

[14],

[15],

[16],

[17],

[18],

[19]).

The

main

problem

in

clustering

is

the

clus

‐

ter

can

be

in

a

different

subspace,

with

the

combination

of

different

dimensions,

if

the

number

of

dimensions

higher,

caused

more

difficult

to

find

clusters.

Subspace

clustering

method

will

automatically

find

the

units

clustered

in

each

subspace.

As

clustering

in

general,

important

to

analyze

in

subspace

clustering

is

the

problem

of

density

of

each

data

object.

In

this

paper

will

discuss

the

application

of

SUBCLU

[20],

FIRES

[21],

and

INSCY

[22]

for

subspace

clustering.

SUBCLU

(density

‐

connected

SUBspace

CLUstering

)

[20],

is

an

effective

and

efficient

method

in

subspace

clus

‐

tering

problems.

Using

the

concept

of

density

in

relation

DBSCAN

[2],

with

grid

‐

based

approach,

this

method

can

detect

the

shape

and

position

of

clusters

in

the

subspace.

Monotonous

nature

of

the

relationship

density,

bottom

‐

up

approach

used

to

pruning

subspace

and

produces

clusters

that

are

connected

with

density

(Figure

‐

5)

and

which

are

not

connected

with

the

density

(Figure

‐

6).

Figure-5 Cluster with density relation

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 11, NOVEMBER 2010, ISSN 2151-9617HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG 16

Figure

‐

6

Cluster

without

density

relation

As

long

as

no

unnecessary

subspace

produced,

the

result

will

be

the

same

SUBCLU

obtained

by

DBSCAN,

SUBCLU

processes

can

be

seen

through

the

algorithm

in

Figure

‐

7.

SUBCLU(SetOfObjects DB, Real ", Integer m)/* STEP 1 Generate all 1-D clusters */ S1 := ; // set of 1-D subspaces containing clustersC1 := ; // set of all sets of clusters in 1-D subspacesFOR each ai 2 A DOCfaig := DBSCAN(DB; faig; ";m) // set of all clusters in subspace ai;IF Cfaig 6= ; THEN // at least one cluster in subspace faig found S1 := S1 [ faig;C1 := C1 [ Cfaig;END IF END FOR/* STEP 2 Generate (k + 1)-D clusters from k-D clusters */ k := 1;WHILE Ck 6= ;/* STEP 2.1 Generate (k + 1)-dimensional candidate subspaces */ CandSk+1 := GenerateCandidateSubspaces(Sk);/* STEP 2.2 Test candidates and generate (k + 1)-dimensional clusters */ FOR EACH cand 2 CandSk+1 DO// Search k-dim subspace of cand with minimal number of objects in the clustersbestSubspace := mins2Sk^s_cand P Ci2CsjCij Ccand := ;;FOR EACH cluster cl 2 CbestSubspace DOCcand = Ccand [ DBSCAN(cl; cand; ";m);IF Ccand 6= ; THEN Sk+1 := Sk+1 [ cand;Ck+1 := Ck+1 [ Ccand;END IF END FOREND FORk := k + 1END WHILE

Figure

‐

7

SUBCLU

Algorithm

Second

subspace

method

used

in

this

paper

is

FIRES

[21].

FIRES

(

FIlter

REfinement

Subspace

clustering

)

frame

‐

work

based

on

efficiency

filter

refinement,

by

determin

‐

ing

the

frequency

scale

quadratic

of

data

dimensions

and

dimensional

subspace

clusters.

This

method

can

applied

to

clusters

that

recognized

based

on

the

local

threshold

density.

FIRES

consist

of

three

phases,

namely

pre

‐

clustering,

in

this

phase

all

the

so

‐

called

cluster

1D

‐

base

clusters

will

be

calculated,

which

can

use

existing

methods

of

sub

‐

space

clustering.

The

second

phase

is

the

generation

of

subspace

cluster

approximations,

in

this

phase

the

exist

‐

ing

clusters

will

combined

to

find

the

maximum

dimen

‐

sional

subspace

cluster

approach,

but

not

incorporate

in

apriori

style,

but

using

the

scale

most

quadratic

of

the

number

of

dimensions.

The

final

stage

is

to

post

‐

processing

subspace

clusters,

by

smoothing

the

cluster

on

the

second

phase.

Another

method

that

was

used

INSCY

[22]

(

INdexing

Subspace

Clusters

with

in

‐

process

‐

removal

of

redundancY

),

used

breadth

first

approach,

by

performing

recursive

mining

in

all

parts

of

the

cluster

of

subspace

projections,

before

passing

to

the

next

section.

This

strategy

has

two

advantages,

high

dimensional

maximal

projection

will

done

first,

then

perform

pruning

of

all

loop

dimensions

and

gain

efficiency,

second,

index

‐

ing

potential

of

subspace

clusters

that

may

occur.

For

more

details

can

see

in

the

algorithm

INSCY

in

Figure

‐

8.

foreach descriptor in scy-tree dorestricted-tree := restrict(scy-tree, descriptor);restricted-tree := mergeWithNeighbors(restricted-tree);pruneRecursion(restricted-tree); //prune sparse regionsINSCY(restricted-tree,result); //depth-first via recursionpruneRedundancy(restricted-tree); //in-process-removal result := DBClustering(restricted-tree) result;

Figure-8 INSCY Algorithm

5 P

ERFORMANCE

E

VALUATION

We

have

tested

SUBCLU,

FIRES

and

INSCY

using

data

sets,

with

processor

1.66

GHz

and

1

GB

of

RAM.

5.1

Data

Sets

In

our

experiments,

synthetic

data

was

used

consist

‐

ing

of

6x1595

dimensions

(6

attributes

and

1595

instant

/

data)

that

our

adoption

of

[12],

graphically

the

initial

data

are

as

in

Figure

‐

9.

Figure

‐

9

Dataset

Distribution

## Add a Comment