I. Data and Methodology
that were included in this study were selected based on the US world news
ranking of the top 200 national colleges (15). The website uses quantitative
and qualitative measures to make sure only the best school are selected.
The education data was gotten from the Integrated Postsecondary Education Data
System(IPEDS) which is a set of surveys conducted annually by the U.S
Department of Education’s National Center for Education Statistics (16). For
the population factors the 2016 Census data (17) was used and attributes
relevant to the study were selected.
A. Data Collection and Attribute Selection
were relevant to the study were collated in a “csv” file. A total of 9 datasets
and 72 attributes were initially collected from IPEDS for top 200 national
colleges. The college attributes selected were the name of college, location,
graduation rate, adjust cohort, average ACT percentile of admitted students, average
SAT percentile of admitted students, number of completers, on and off campus
expenses, In-state and out-state tuition fee. From the 2016 Census data
attributes selected were household income, population, educational attainment,
median age, room occupancy.
The data mining
tool used for this analysis of this study was WEKA. It is an open source
software that supports many data loading, data transformation, data modeling,
and data visualization methods and is widely used in data mining application.
It also has an easy step-by-step tutorial and an intuitive GUI for the design
and processing of data mining processes. The merged “csv” file containing
relevant attributes was loaded into WEKA explorer.
C. Data Preprocessing
Before applying any data mining techniques datasets were
preprocessed by merging the 9 datasets from the IPEDS into one file based on
the college name and location of the college to easily handle the dataset.
Relevant fields were selected for each college from the IPEDS dataset. Census
data were merged by city. i.e. if state name in the college information table
is the same as that of the Census information table relevant attributes are
merged. College data and census data were then merged in one single “csv” file.
Final file contained a total of 207 universities and 9 attributes. Data
instances with incomplete attribute values were removed.
The graduation rate attribute was used as the performance
metric and used to classify colleges into high and low graduation rates. The
values for graduation rates were widely distinct so normalization was done by
dividing rates into 3 intervals i.e. 0-33.33, 33.33-66.66, 66.66-100 and
labeled as low, medium and high graduation rate. Certain attributes were
combined into one to easily handle the dataset. For example, on-campus
accommodation expenses and on-campus food expenses were combined into in table
as total on-campus expenses. Supervised discretization was done on attributes
with distinct values to normalize them.
correlation matrix was done to determine which attributes were. The top 10
attributes strongly correlated to the graduation rates were selected as final
attributes to be used for prediction. Attribute selection evaluator was used to
rank the attribute related to the class label i.e. graduation rate. SAT and ACT
percentile of the college were ranked top on the list with 0.3247 and 0.2377
rank respectively. Out of state tuition fee: 0.1339, number of completers:
0.099, In-state tuition fee: 0.0882. the preprocessed data set was then
used for the prediction.