GOJ-2321.pdf

Page 1 of 5

AN EFFICIENT MINING APPROACH FOR

HANDLING WEB ACCESS SEQUENCES

R.Nandhini

Department of Computer Science and Engineering

Sri Shakthi College of Engineering and Technology

(Autonomous)

Coimbatore, India

nanishreesha@gmail.com

Mrs.S.V.Evangelin Sonia

Department of Computer Science and Engineering

Sri Shakthi College of Engineering and Technology

(Autonomous)

Coimbatore, India

evangelinsonia@siet.ac.in

Abstract— The World Wide Web (WWW) becomes

an important source for collecting, storing, and

sharing the information. Based on the users query the

traditional web page search approximately retrieves

the related link and some of the search engines are

Alta, Vista, Google, etc. The process of web mining

defines to determine the unknown and useful

information from web data. Web mining contains the

two approaches such as data based approach and

process based approach. Now a day the data based

approach is the widely used approach. It is used to

extract the knowledge from web data in the form of

hyper link, and web log data. In this study, the

modern technique is presented for mining web access

utility based tree construction under Modified Genetic

Algorithm (MGA). MGA tree are newly created to

deploy the tree construction. In the web access

sequences tree construction for the most part relies

upon internal and external utility values. The

performance of the proposed technique provides an

efficient Web access sequences for both static and

incremental data. Furthermore, this research work is

helpful for both forward references and backward

references of web access sequences.

Keywords— Genetic Algorithm, Classification and

Regression Tree, Hyper Text Transfer Protocol,

Internet Protocol, Structured Query Language

I. INTRODUCTION

The way toward separating helpful and

interesting data from the data storehouses is as

called mining. In this modern era, information plays

a vital role. Earlier using elegant technologies like

computers, satellites, etc., enormous information are

collected and stored in mass storage devices. Vast

collection of data resulted in a mess and this leads

to the structuring and managing of data in a well- organized manner by the usage of databases.

Database Management System (DBMS) helps to

store and retrieve data from the large repositories

efficiently using queries. Web mining is the derived

concept from data mining, which extracts the

information directly from web services, web

documents, hyperlinks, web contents, and web

server logs. It mainly concentrates on the World

Wide Web (WWW) that includes its primary

source, components, and contents. The data

contents are extracted from a website that would be

the collection of web pages and it contains

structured data. It represents tables, lists, images,

audio, and video. Web mining is used to determine

the information from web data in the data mining

process. In addition, it provides a robust to a web

search engine by analyzing web content and web

document categorization. It is most useful for e- services and e-commerce applications. Figure 1.1

shows the web mining services

Figure 1.1 Web Mining Services

Also, web mining is used to understand the customer

behavior and evaluates the particular web site effectiveness

(Neelima et al. 2016). WWW contains diverse dynamic,

massive, and mainly unstructured data that provides a huge

amount of information. Web growth gives to some issues

such as determining relevant data through the internet and

observes user.

Web usage mining is the mining technique, which applied to

determine the user access patterns from web repositories.

When the user visits the web pages, automatically web

GEDRAG & ORGANISATIE REVIEW - ISSN:0921-5077

VOLUME 34 : ISSUE 01 - 2021

http://lemma-tijdschriften.com/

Page No:253

Page 2 of 5

servers record the user information such as URL, IP

Address, Hits, and weblog file. This file is the input for web

usage mining. The proposed novel hybrid approach

improves web usability with two attributes such as Hit and

Time Spent. The web server logs contain the information of

user sessions and user-oriented tasks. The user session

provides information about the user spent time inappropriate

website. Moreover, it generates the web site ranking

accurately with a clustering approach. To motivate the

successful web access sequences, we have used the web

utility mining system with utility web access. Solutions are

offered for the search challenges by the proposed hill- climbing optimization approach. In addition, the genetic

algorithm is one of the optimum processes that encompass

the extensive issues then by utilizing the local optimums, the

complex search space is also solved.

The rest of this paper is formed as follows, related animal

classification work in section 2, the proposed approach

explained in section 3, material and method described in

section 4, results discussion parts presented in section 5

conclusion in section.

II. LITRATURE REVIEW

Now a day, the internet development was incredible. The

huge measures of data from relevant data to users find it

extremely difficult. The issues can be solved by web usage

mining which includes preprocessing. Chitraa & Thanamani

(2011) designed a new technique to identify sessions in Web

usage mining (WUM). This was mainly focused on the

preprocessing approach. Unnecessary records comprised of

graphics files, robots are removed in the data-cleaning

phase. In the next phase, identification of sessions, this was

derived by forming the user behavior in a matrix format.

Matrix comprised of rows and columns in which columns

indicate the web pages and rows indicates the users and their

sessions are identified. The experimental results showed that

the session identification method was effective and accurate.

Pamutha, Chimphlee, Kimpan, & Sanguansat (2012)

discussed data preprocessing method for mining user’s

access patterns on web server log files. WUM is to convert a

log into a set of web user sessions. A web log file was

gathered from the web server and focused on the

preprocessing of the weblog file methods that can be used

for the task of session identification. The resulted study

produced statistical information on user sessions.

Maheswara Rao & Valli Kumari (2011) implemented an

extensive research framework capable of preprocessing web

log data. The learning algorithm of the proposed research

framework can isolate human user and search engine

accessed with less time. The framework reduced the error

rate and improved significant learning performance. This

framework aided to investigate web user usage behavior

effectively. The result showed that the employment of the

proposed framework of IPS provided a promising solution

in dynamic weblog development.

Pathak, Shah & Almeera (2014) presented an algorithm for

pattern discovery based on the association between the

users’ accessed web pages. This paper discussed a complete

preprocessing method to identify distinct users. The

association rule-mining algorithm is to find the frequently

accessed web pages. The biggest constraint for mining web

usage patterns are computation and memory overhead. The

experimental result showed that the algorithm was efficient

and scalable.

(Huang et al., 2015) presented an AutoODC (Auto

Orthogonal defect classification) approach to automate ODC

classification by forming it as a supervised text

classification issues. ODC is a framework used for software

defect analysis and classification, which provides a valuable

in-process feedback to system development and

maintenance. It is promising approach. This paper trained

AutoODC with the support of two machine learning

algorithm for support vector machine, Naïve Bayes and text

classification and estimated it on both industrial and larger

defect list where the industrial defect was reported from

social network domain and larger defect list was extracted

from open source system FileZilla. This approach achieved

overall accuracy of 83% (NB) and 81% (SVM) on the

industrial defect report and accuracy of 77 % (NB) and 75

% (SVM) on the larger defect list. The preprocessing

techniques are used to convert the raw data into data

abstraction based on the required users, sessions, and page

views. The recommendations and ranking techniques are

used to assign rank to the web page according to the impact

of the webpage. The tree based approaches are used to

construct the Utility based web tree in high utility web

access sequences.

III. SYSTEM DESIGN

The clustering is used to grouping the web session based on

similarity and it maximizes the intra-frame similarity

(Vellingiri et al.2015). The web session contains hyperlink

clicks. Clustering web session topics have the most popular

in various applications. In web mining, the log file defines

three steps such as data gathering, filtering, and formatting

of log entries. Various algorithms are presented for pattern

discovery named

• Clustering

• Sequential pattern analysis

• Rule mining

• Classification

However, the clustering acts to robust for determining the

web sequences. For determining the similarity between two

web sites, first, it represents the URL as a token. In this

similarity computation, we have to compare the

corresponding token at the beginning and comparison will

stop when the tokens are stopped

Figure 2 Website tree structure

GEDRAG & ORGANISATIE REVIEW - ISSN:0921-5077

VOLUME 34 : ISSUE 01 - 2021

http://lemma-tijdschriften.com/

Page No:254

Page 3 of 5

Figure 2 defines the website tree structure in the clustering

session based on the user-accessed website. The clustering

session is an important factor in web mining and analyses of

user access behavior.

The main challenge is to determine both forward

and backward web sequences. To recover this issue, the

proposed method is presented with tree construction and

MGA. This tree construction combines the two trees of

SVM tree and IGA tree. This proposed tree construction

detects the user access patterns in large database scans. The

innovative web access utility is clearly shown in this picture.

Web Log

Database Extract

Browsing Data

Compute

WASu

value of

each

sequence

Prefix Tree

Evaluate Construction

High Utility Web Access

Sequence

Figure 3 Flow of the Proposed Method

3.1 HILL CLIMBING ALGORITHM

It is one of the local search algorithms, and it

is used to solve the optimization problems in AI. It also

called a greedy approach (Bykov et al. 2016). It

continuously moves increasing direction to determine the

peak of a mountain or to determine the best solution for a

problem. After it reaches the peak value, it terminates when

no neighbor has a higher value. It mainly used for

optimizing mathematical problems. Traveling salesman

problem is the example of a hill-climbing algorithm; it

needs to reduce the distance traveled by a salesperson. This

algorithm contains two basic components such as state and

value.

 Estimate the initial or primary state, or when it is a

goal state then return success and stop.

 Loop until the solution is determined or there is no

operator left to apply.

 The operator is applied to the current state.

 Identify new state

i. It the state is goal state, it returns success

and stop.

ii. Else, if it is greater than the current state,

then allocate a new state as the current

state.

iii. Else, if it is not better than the current

state then back to step 2

 Exit

3.2 GENETIC ALGORITHM

A genetic algorithm is an optimization technique and

heuristic search that mimic the natural evolution process.

Optimization defines to determine the best set of output

values from the set of input values. In web mining, the meta

search engine searches the requests by yahoo, vista. The

individual search engine results are combined as a single

result set. Meta search engine improves the consistent

interface and coverage. N number of potential solutions for

optimization problems categorizes genetic search.

Figure 4 Genetic algorithm steps

 Initially, the GA algorithm initializes the parameters

for optimization.

 Then, determine the chromosome representation of

parameters.

 Thirdly, generate the individuals of the initial

population.

 Then, evaluate the fitness function for each

individual.

 Create a new population-based on random behavior

or selection rules.

The inspiration consequence of novel approach clarifies

the capacity of our new technique to finish the high utility

web access sequence for incremental mining.

pseudocode for proposed improved Genetic algorithm

steps

Step 1: Randomly create the initial solution

(where, i = 1, 2... n).

Step 2: Evaluate the fitness function

Fitness function  sumo f the total weight for each user

(4.14)

Each parameters of the fitness value is

estimated and shortlisted the greatest fitness

value as the best chromosome.

Step 3: To achieve the best solution, relate the

mutation and crossover

GEDRAG & ORGANISATIE REVIEW - ISSN:0921-5077

VOLUME 34 : ISSUE 01 - 2021

http://lemma-tijdschriften.com/

Page No:255

Page 4 of 5

Mutation: According to the probability, the

chromosome values are varied.

Crossover: In this process, choose one or more

parent chromosomes and after mutation, the

new solution is produced.

Step 4: Hill climbing algorithm is performed when the

new solution is infeasible.

Step 5: Current solution is enlarge

Step 6: Fitness function is discover

Step 7: When the fitness value of new function is

higher than the current solution, select

the new solution is the best one.

IV. RESULT AND DISCUSSION

The proposed method performances are evaluated from

FDR rate, tree construction time, and runtime and memory

location by adjusting the threshold value. False Detection

Rate (FDR) defines the rate of a false positive and false

negative in the null hypothesis when acquiring multiple

comparisons.

For threshold value 0.1, the SVM tree contains 0.004

FDR value and the IGA tree has 0.003 FDR value. For

threshold value 0.15, the SVM tree contains 0.0056 FDR

value and the IGA tree has 0.004 FDR value. For threshold

value 0.2, the SVM tree contains 0.009 FDR value and the

IGA tree has 0.0084 FDR value. For threshold value 0.25,

the SVM tree contains 0.019 FDR value and the IGA tree

has 0.012 FDR value. Figure 6 defines the statistical results

of FDR value for both tree SVM and IGA.

Table 1 False Detection Rate

Threshold SVM IGA

0.1 0.004 0.003

0.15 0.0056 0.004

0.2 0.009 0.0084

0.25 0.019 0.012

Table 2 describes the tree construction time for both tree

SVM and IGA. Time expended for the construction of the

tree is assessed by altering the value of the threshold. In the

SVM tree, when the value of the threshold is 0.1, time

devoured to the tree is observed to be 12s and furthermore,

the IGA tree is 18s for comparing time. At the point when

the value of the threshold is set to 0.15 then the SVM and

IGA tree construction time values are observed to be 8s and

9s. At the point when the value of the threshold is 0.2, tree

construction time 7s for SVM and IGA is 9s. At the point

when the value of the threshold is altered to 0.25, tree

construction time 7s for SVM and IGA of the relative time

values are observed to be 9s. Figure7 depicts the statistical

analysis of Tree construction time based on the threshold

value. Based on this results SVM tree has minimum

execution time compared to the IGA tree.

Table 2 Tree construction time for both tree SVM and

IGA

Threshold value

Tree construction time (sec)

SVM tree IGA tree

0.1 12 18

0.15 8 9

0.2 7 9

0.25 7 9

Table 3 defines the memory allocation for both tree SVM

and IGA. For 0.1 threshold value, the SVM tree has 278808

memory allocation times, and IGA contains 299874. For

0.15 threshold value, the SVM tree has 278896 memory

allocation times, and IGA contains 278945. For the

threshold value 0.2, the SVM tree has 2778726 memory

allocation times, and IGA contains 281451. For the 0.25

threshold value, the SVM tree has 279184 memory

allocation times, and IGA contains 277818. Figure 5.4

depicts the statistical analysis of memory allocation for both

SVM and IGA tree.

Table 3 Memory allocation for both tree SVM and IGA

Threshold value

Memory allocation (bits)

SVM tree IGA tree

0.1 278808 299874

0.15 278896 278945

0.2 277872 281451

0.25 279184 277818

GEDRAG & ORGANISATIE REVIEW - ISSN:0921-5077

VOLUME 34 : ISSUE 01 - 2021

http://lemma-tijdschriften.com/

Page No:256

Page 5 of 5

The proposed method determines both internal and external

web access sequences. The results section contains the

performance measures of HUWAS and HIUWAS FDR rate,

tree construction time, and run time and memory location by

adjusting the threshold value. The comparative analysis

compares the proposed method accuracy with various

existing methods and it proved the proposed method has the

highest accuracy.

V. CONCLUSION

In this study, the main research is web usage

mining. Web usage mining is the important factor in wide

range of applications such as business intelligence,

recommendation, web traffic, customer attraction, system

improvement and cross sales proposed the Hybrid Hill

Climbing Genetic Algorithm (HHCGA) based on tree

construction for extracting the web access sequence. For

tree construction, it designed with HUWAS tree (HHCGA

and Utility-based Web Access Sequence tree) and the

HIUWAS tree (HHCGA and Incremental Utility-based Web

Access Sequence tree). This utility based approach

determines both forward and backward references of the

web access sequences. In evaluation results, the

performance measures of HUWAS and HIUWAS FDR rate,

tree construction time, and run time and memory location

were evaluated by adjusting the threshold value. From this

performance analysis, it is observed that the proposed

technique provides an efficient Web access sequences for

both static and incremental data.

REFERENCES

1. Neelima, G., & Rodda, S. (2016, March). Predicting user

behavior through sessions using the web log mining. In

2016 International Conference on Advances in Human

Machine Interaction (HMI) (pp. 1-5). IEEE.

2. Chitraa V & Thanamani DAS 2011, ‘A novel technique

for sessions identification in web usage mining

preprocessing’, International Journal of Computer

Applications, vol. 34, no. 9.

3. Pamutha T, Chimphlee S, Kimpan C & Sanguansat P

2012, ‘Data preprocessing on web server log files for

mining users access patterns’, International Journal of

Research and Reviews in Wireless Communications

(IJRRWC), vol. 2.

4. Rao, V. M., & Kumari, V. V. (2011). An Enhanced Pre- Processing Research Framework For Web Log Data

Using A Learning Algorithm. Computer Science and

Information Technology, 10(5121), 01-15.

5. Pathak N, Shah V & Ajmeera C 2014, ‘A Memory

Efficient Algorithm with Enhance Preprocessing

Technique for Web Usage Mining’, In Proceedings of

the 2014 International Conference on Information and

Communication Technology for Competitive Strategies

ACM, p. 47.

6. Huang, L., Ng, V., Persing, I., Chen, M., Li, Z., Geng,

R., & Tian, J. (2015). AutoODC: Automated generation

of orthogonal defect classifications. Automated Software

Engineering, 22(1), 3-46.

7. Vellingiri, J., Kaliraj, S., Satheeshkumar, S., &

Parthiban, T. (2015). A novel approach for user

navigation pattern discovery and analysis for web usage

mining. Journal of Computer Science, 11(2), 372.

8. Burke, E. K., & Bykov, Y. (2017). The late acceptance

hill-climbing heuristic. European Journal of Operational

Research, 258(1), 70-78.

GEDRAG & ORGANISATIE REVIEW - ISSN:0921-5077

VOLUME 34 : ISSUE 01 - 2021

http://lemma-tijdschriften.com/

Page No:257