ABSTRACT 5 8.1 DATASETS 13 8.2 TEST DATA

ABSTRACT

The systems and software in the current
scenario being developed are having a high probability of being attacked. So
the need to develop tools that would prevent such kind of attacks is increased.
Many ideas of detecting vulnerabilities have been in use. In this project we
present the idea of detecting the vulnerabilities in text based scenario like
mailing system with an improved accuracy and also prevent the sending of a
vulnerable content to others. The concepts of Cross-validation and Ensembling
are used in order to achieve the improved accuracy. The dependent or the target
variables are chosen in a way such that models are obtained in an efficient
manner. The improved efficiency is determined using the accuracy component of
the testing dataset and is proved in this phase of the project. The concept of
ensembling and prevention of the vulnerable data from the source is to be
proposed proved in the next phase.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

                                                                                               

 

 

 

 

 

 

 

 

TABLE OF CONTENTS

CHAPTER NO         

TITLE

  
PAGE NO
 

 

ABSTRACT
 

IV

 

LIST OF FIGURES
 

VII

1.

INTRODUCTION
 

 

 

1.1  
OVERVIEW OF THE PROJECT
 

1

 

1.2
PROBLEM STATEMENT
 

2

 

1.3
CHALLENGES AND SCOPE
 
1.4
ORGANIZATION OF THE REPORT
 
      

2
 
2
 
 

2.

LITERATURE  SURVEY
 

 

 

2.1
REVIEW        
 

3

3.

SYSTEM
ANALYSIS

 

 

 

 

 

3.1
EXISITING SYSTEM

5

 

 

 

 

3.2
PROPOSED SYSTEM

5

 

 

 

4.

DESIGN AND IMPLEMENTATION
 

 

 

4.1
OVERALL DESCRIPTION
 
4.2
ARCHITECTURE DIAGRAM
 

6
 
6

 

4.3
LIST OF MODULES
 

7

 

      4.3.1 PRE-PROCESSING
 

7

 

      4.3.2 CLUSTERING
 

7

 

      4.3.3 TUNING
 

8

 

      3.3.4 TRAINING
 

8

 

      3.3.5 EVALUATION
 

9

4.

DEVELOPMENT ENVIRONMENT

 

 

 

 

 

HARDWARE REQUIREMENTS

10

 

 

 

 

SOFTWARE REQUIREMENTS

10

 

 

 

5.

RESULTS AND DISCUSSION

 

 

 

 

 

5.1
EVALUATION METRIC
 

11

 

5.2
ANALYSIS OF RESULTS
 

11

6.

CONCLUSION AND FUTURE WORK
 

12

7.

OUTPUT OF MODULES

13

 

 

 

8.

REFERENCES

18

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

LIST OF FIGURES

FIGURE

NAME
OF FIGURE    

PAGENO
 

4.2

ARCHITECTURE DIAGRAM

5

8.1                                   

DATASETS
 

13
 

8.2

TEST DATA          

14

 

 

 

8.3

TRAIN DATA

15

 

 

 

8.4

MODEL

16

 

 

 

8.5                                           

TUNEGRID

17

 

 

 

8.6

RESULT
 

17

 

 

 

 

 

 

 

 

 

 

 

 

CHAPTER 1

INTRODUCTION

1.1 OVERVIEW OF THE PROJECT

A massive number of web applications and services has
been used in financial and banking services, government, healthcare, retail and
many other fields. This because web applications and services offer some
important advantages including the accessibility from different locations and
devices, the enhancement of user interaction and the improvement in the quality
of the services provided to users. In most of these applications, developers
focus on usability and functionality while security usually comes as an
afterthought, a situation which increases the number of vulnerabilities in the
web applications.

As the statistics indicate, it is hard to develop a fully
reliable software. Thus, it is important to test software components to increase
the level of assurance that software components are free of security
vulnerabilities. However, testing resources such as testers and time are
limited. Also, most of the vulnerable components are due to import functions
call and the improper handling of user input. This increases the difficulty of vulnerabilities
discovery. For example, in PHP using unfiltered input as a parameter to $_GET
or $_POST might allow a malicious user to execute SQL Injection attack, while
the call of echo function with an invalidated user input might exploits a Cross-site
scripting (XSS) vulnerability. To solve this problem many models and tools have
been developed to predict vulnerabilities in a software component. Typically
such methods depend on parsing the code and are limited to fixed and very small
patterns, and hardly adapt to variations. The static analysis methods, which
are also used for vulnerability detection, have a high rate of false positive
and false negative in vulnerability detection phase . A wide variation of data mining
and machine learning techniques has been used to improve the ability to predict
web application vulnerabilities. For instance, feature extraction and
classification are used to predict if SQL injection vulnerability resided in
the software or not. Additionally, machine learning methods are used to increase
the ability to cover a wide range of malicious web code.

 

                                                                                       

 

 

1.2 PROBLEM STATEMENT

The recent developments
in text mining have resulted in accuracy rate of 90% which thereby indicates
that there is still more room for improvement and also reduce the false positive
rate. Also normal single case models do not give the exact result. The false
positive rates go on increasing with number of data in the datasets. The actual
accurate classifiers need to be determined in order to get better results.

Hence this project
includes the method of Ensembling which combines various methods and models
available in data mining and also preventing the vulnerable data from the
source using text mining techniques.

 

1.3 CHALLENGES AND SCOPE

·        
The accuracy achieved through this project
is 94% which can be increased further.

·        
The classifiers considered can be
changed further to improve efficiency.

·        
The proposed project is subject to text
mining and so still other mining techniques like spatial and correlation
techniques can be used.

 

1.4 ORGANIZATION OF THE REPORT

Rest of the thesis is
organized as follows, the second chapter deals with literature survey for the
thesis which is followed by the third chapter containing Detailed Design and
Implementation details. The fourth chapter provides detailed result analysis
followed by future work and conclusion in fifth chapter.

 

 

 

 

 

 

 

CHAPTER
2

LITERATURE SURVEY

2.1 REVIEW     

The Symantec
Corporation security report statistics for 2015 showed that 78% of websites
have at least one vulnerability, where 15% of the vulnerabilities are critical
ones. The statistics from White-Hat report for 2016 showed that the average
number of vulnerabilities per site is 23, of which 13 are critical
vulnerabilities. Also, it showed that vulnerabilities stay open for a very long
time. Critical vulnerabilities have an average age of 300 days. These results
indicate that the web applications still contain many vulnerabilities. A study
was conducted to find why there are so many vulnerabilities in web application.
The main reason is building stateful applications on the web stateless
infrastructure. The web servers are designed to be stateless, and HTTP which is
the main protocol used for communication between the server and the client is a
stateless protocol. Each HTTP request is processed independently at the server,
even if there are two related requests they didn’t contain information about
each other. However, most of the web application are stateful and the server
should be able to recognize the dependency between the requests. Sessions and
cookies have been used to solve this problem. However, there are four security
properties that sessions do not emulate: preservation of trust state, data integrity,
code integrity, and session integrity. Therefore, dealing with a stateless
framework without preserving these security properties will increase the number
of vulnerabilities if a stateful application is running under this framework. According
to Open Web Application Security Project (OWASP) the top 10 security risks for
2013 were the following: 1- Injection, 2- Broken Authentication and Session Management,
3- Cross-Site Scripting (XSS), 4- Insecure Direct Object References, 5-
Security Misconfiguration, 6- Sensitive Data Exposure, 7- Missing Function
Level Access Control, 8- Cross-Site Request Forgery (CSRF), 9- Using Components
with Known Vulnerabilities, 10- Unvalidated Redirects and Forwards. Table 1
shows the exploitability, impact and what the attacker can do if the
vulnerability has been exploited. Data mining is used in many applications. The
most successful and popular applications are Business Intelligence and Search
Engines. However, in the recent years, data mining techniques have also been
used in security.

Data mining techniques
added enhancements to improve the security field. For example, in they proposed
a model for improving detection of malicious spam in educational institutes.
They used a set of data mining techniques like feature extraction, feature
selection and different classifiers: Naïve Bayes, Support Vector Machine and
Multilayer Perceptron. Usually, data mining techniques are used first to
extract rules and features from the dataset, then select the best rules of
features to be used as parameters of the classifier. The classifier is trained,
then used to classify new instances. This scenario is used in the applications
mentioned above and in other applications, one of these applications in the
security field is vulnerability detection, which this paper is talking about.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CHAPTER 3

DESIGN AND IMPLEMENTATION

3.1 EXISTING SYSTEM

The data mining technique that is being
used comprise of a model that helps in training the train data set. The model
is made up of techniques without any Cross Validations and repeats. Hence the obtained
accuracy is around 92%. The false positive rate is also high. Though all kind
of vulnerabilities are considered, the results of all vulnerabilities are of
the same accuracy. The vulnerabilities include XSS, SQL Injection.    

3.2 PROPOSED SYSTEM

The efficient techniques such as Cross
Validations are used with a good number of repeats. Also, the timing is
improved with the help of workers being assigned based on the number of core
processors. The model applied in the trained dataset is then converted to tune
grid. The tune grid determines the efficient parameters that should be
considered while any data set is passed. So on passing the test data to the
obtained model an accuracy of 94% is obtained.

 

 

CHAPTER 4

DESIGN AND IMPLEMENTATION

4.1 OVERALL DESCRIPTION

The work aims at improving the accuracy
of determining the vulnerable data thereby preventing the users from accessing
it and protecting the system. The preprocessing of the dataset is done in order
to clean the data from unwanted information. The cleaned data is then taken and
trained and the training model is obtained. The necessary packages are
installed and added to the project. The training data set and the testing data
set is split from the available data set. The trained model is taken and the
testing data is passed through it to obtain the prediction and the accuracy.

 

4.2 ARCHITECTURE
DIAGRAM

 

4.3 LIST OF MODULES

Pre-processing
Clustering
Tuning
Training
Evaluation

 

4.3.1
PRE-PROCESSING

The dataset is taken from the already available
datasets in the open source. The dataset comprises of 4600 observations. Each
of these observations includes 58 variables which determines what are the words
and the frequency of them. The set of names which act as column names in
dataset are available separately and are associated together to confirm the
validity of the data. Thus the final obtained dataset in this pre-processing
method can be applied or used directly to perform the operations.

dataset <- read.csv("data.csv", header=FALSE, sep=";") names <- read.csv("names.csv", header=FALSE, sep=";") names(dataset) <- sapply((1:nrow(names)),function(i) toString(namesi,1))   4.3.2 CLUSTERING The dataset needs to be split so that the train data and test data can be obtained. In order to obtain this split of dataset we add the packages that are required namely 'Caret', 'Kernlab' and 'doParallel'. The dataset is then partitioned with the help of createPartition method of the caret package. Thus the trained data then contains 2/3rd of the data in dataset and the remaining is taken as the test data. To improve the efficiency 4 worker thread are created and they are made to work in parallel. trainIndex <- createDataPartition(sample$y, p = .8, list = FALSE, times = 1) workers=makeCluster(4,type="SOCK") registerDoParallel(workers)     4.3.3 TUNING The Support Vector Machine defines all the values in the dataset in the spatial way to determine in which classifier they come. Then on addition of each value to the dataset or the training data SVM decides in which part of the classifiers it would come. To consider the exact parameters for the training model the sigest function is used which determines the closes value for sigma between 0.1 and 0.9 quantile of ?x?x??2. Once the closest or the most efficient parameter is chosen the grid is created with the parameter. The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points. For very tiny values of C, you should get misclassified examples, often even if your training data is linearly separable. The train function uses this grid to create for every combination a SVM and just keeps the one which performed best.   sigDist <- sigest(y ~ ., data = dataTrain, frac = 1) svmTuneGrid <- data.frame(.sigma = sigDist1, .C = 2^(-2:7)) 4.3.4 TRAINING The SVM is trained with the train() function of the caret package. This function can be used for all the models and algorithms in the caret package. We define which data we want to use and what method to create the model. The SVMRadial method is used and the training data is passed in the train function. Also the trControl is specified which performs Cross Validation of 5 repeats thereby ensuring that the data and class probabilities don't cause any issue for the trained result. x <- train(y ~ ., data = dataTrain, method = "svmRadial", preProc = c("center", "scale"), tuneGrid = svmTuneGrid, trControl = trainControl(method = "repeatedcv", repeats = 5, classProbs =  FALSE))   4.3.5 EVALUATION Now we created our model X. We can use this model to classify emails as spam or non-spam; so to perform a binary classification. For the evaluation of the model we use the dataframe dataTest and the predict() function of the caret package. We exclude the last column of the dataframe which contents the label if the email is spam or no spam. We save the predicted results in the variable pred and compare the results based on our model with the actual results in the last column of the dataTest dataframe. pred <- predict(x,dataTest,1:57) acc <- confusionMatrix(pred,dataTest$y)     CHAPTER 5 DEVELOPMENT ENVIRONMENT 5.1 HARDWARE REQUIREMENTS                 HARDWARE      CONFIGURATION       RAM               1 GB and above      Processor       Dual core and above      Hard Disk       80 GB and above Table:4.1 hardware requirements   5.2 SOFTWARE REQUIREMENTS        SOFTWARE     VERSIONS     Operating System    Windows 7     Application Environment     RStudio     Programming Language     R   Table:4.2 software requirements       CHAPTER 6 RESULTS AND DISCUSSION 6.1 EVALUATION METRIC                          ACCURACY = Where, TOTAL POPULATION = TRUE POSITIVE + FALSE POSITIVE+ TRUE NEGTIVE + FALSE NEGATIVE • TRUE POSTIVE – Data correctly identified as vulnerable. • FALSE POSITIVE – Correct data incorrectly identified as vulnerable. • TRUE NEGATIVE – Data correctly identified as not vulnerable. • FALSE NEGATIVE – Incorrect data identified as normal.     6.2 ANALYSIS OF RESULTS The test data comprising of 800 odd values on going through the trained model which is an SVM model produces a result with accuracy of 4% more than the previously defined method. The accuracy comes up as 94% for the test data. The figure shows the obtained accuracy for the result data.   CHAPTER 7 CONCLUSION AND FUTURE WORK The data thus has been filtered to figure out what are the data that are vulnerable and non-vulnerable data. The improved accuracy helps in better filtering of data. The future work is to implement Ensembling models in order to achieve still better accuracy results. Also the method of preventing the vulnerable data can also be proposed thereby preventing the impact of vulnerable data during the transmission of it and safeguarding the entire system. Ensembling is a general term for combining many classifiers by averaging or voting. It is a form of meta learning in that it focuses on how to merge results of arbitrary underlying classifiers. Generally, ensembles of classifiers perform better than single classifiers, and the averaging process allows for more granularity of choice in the bias-variance tradeoff. Names of ensemble techniques include bagging, boosting, model averaging, and weak learner theory. An obvious strategy is thus to implement as many different solvers as possible and ensemble them all together, a sort of "More Models are Better" approach. Text Mining is the key to determine the vulnerable data at the source and efficient methods in adopting text mining will improve the mining results.                   CHAPTER 8 OUTPUT OF MODULES                           CHAPTER 11   REFERENCES 1.      Symantec Corporation, "Internet Security threat Report", vol.21, Apr 2016 2.      Whitehat Security, "Web Applications Security Statistics Report" 2016 3.      Hawkinson, J; T. Bates (March 1996). "Guidelines for creation, selection, and registration of an Autonomous System". Internet Engineering Task Force. ASguidelines. Retrieved 2007-09-28. 4.      Gary, Software Security. Addison Wesley 2006. 5.      J. Han, "Data Mining: concepts and techniques". Elsevier, 2011 6.       D. Hoveymer and W. Pugh, "Finding bugs is easy", vol. 39, no.12, pp. 92-106,Dec 2004