Q.1. iii. regression algorithm · When to use


Q.1. How will you choose the right algorithm? (with justification)


mainly depends on what type of problem is to be solved for that problem
statement carries a major role.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

a problem statement to be created input and output of the problem to be solved
need to be present, Also the problem statement should not be jumbled up it
should be up to the mark

problem statement is identified below mentioned algorithms can be choosing

Classification Algorithm

Clustering algorithm

regression algorithm


When to use classification

depends whether your problem statement is to divide or partition certain things
in certain classes

ex (he is fat or this).

is a Supervised learning

of Classification Algorithm:

K – Nearest neighbour

Decision Trees

Bayesian Classifier


When to use clustering algo

dividing a large data set in a clusters i.e. in a groups is to be performed
clustering algorithms are used

is an unsupervised learning

of Clustering Algorithm:


K- Means Algorithm

Expectation maximization


When to use regression algo

the data set provided is a numerical value and the output to be predicted is
also a numerical value the regression can be applied.

is supervised learning

of Regression Algorithm:

Linear Regression Algorithm

Logistic Regression Algorithm

Polynomial Regression Algorithm etc.





Q.2. Write in detail about the different steps that you will perform to
develop the machine learning application for the given scenarios

1.      For
Spam Detection

2.      Recommendations

3.      Stock
market prediction

4.      Automated



For Spam Detection

– Since in Spam discovery we need to group whether spam or not
spam thus characterization calculation is to be connected

The spam discovery calculation will include five stages:

Loading the information,


Extracting the highlights,

Training the classifier, and

Evaluating the classifier.


Stage 1: Loading the information

A machine learning framework works in two modes: preparing and

Preparing: Amid preparing, the machine learning framework is given named

from a preparation informational index. For egg, the named preparing

is an extensive arrangement of messages that are named spam or not
spam (ham)? Amid

the preparation procedure, the classifier (some portion of the
machine learning framework that

all things considered predicts marks of future messages) gains
from the preparation information by deciding the associations between the
highlights of an email and its mark.

Testing: Amid testing, the machine learning framework is given unlabelled
information. For egg,

this information are messages without the spam/ham mark.
Contingent upon the highlights of an

email, the classifier predicts whether the email is spam or ham.
This grouping

is contrasted with the genuine estimation of spam/ham to quantify


Stage2: Preprocessing:

Before nourishing the messages to our classifiers, we have to
pre-process the messages. The objective is to make an element lattice with
lines being the email and segments being the highlights. Subsequent to
expelling HTML labels and extricating the pertinent content, extra
pre-preparing must be done to make the element framework.

After preparatory pre-handling (expelling HTML labels and headers
from the email in the informational collection), we make the accompanying

Tokenize – We make “tokens” from each word in the email by
expelling accentuation.

Expel trivial words – The content in red squares are stop-words,
which ought to be evacuated. Stop-words don’t give important data to the
classifier, and they increment dimensionality of highlight grid.
Notwithstanding numerous stop-words, we expelled words more than 12 characters
and words under three characters.

Stem – The content in blue circle is changed over to its
“stem”. Comparative words are changed over to its stem with a
specific end goal to frame a superior component grid. This enables words with
comparative implications to be dealt with the same. For instance, history,
histories, noteworthy will be viewed as same word in the component network.
Each stem is set into our “pack of words”, which is only a rundown of
each stem utilized as a part of the dataset.

Make include network – After making the “sack of words”
from the greater part of the stems, we make an element framework. The component
grid is made with the end goal that the passage in push I and section j is the
circumstances that token j happens in email I.



Stage 3: Extracting the highlights

When content is pre-handled, you can remove the highlights
describing spam and ham messages. The main thing to see is that a few words,
for example, “the”, “is” or “of” show up in all
messages and don’t have much substance to them. These words are not going to
enable you to recognize spam from ham. Such words are called stop words and
they can be slighted amid arrangement.

To extricate the highlights – words that can tell the program
whether the email is spam or ham – you’ll have to do the accompanying:

i.     Read in the content of the email.

ii.     Pre-process it utilizing the capacity pre-process characterized

iii.     For each word that isn’t in the stop word list, either

iv.     calculate how every now and again it happens in the content, or

v.     register the way that the word happens in the email.

vi.     The previous approach is known as the pack of-words (bow), and it
enables the classifier to see that specific watchwords may happen in the two
kinds of messages however with various frequencies.


Stage 4: Training a classifier

Since the information is in the right organization, you can part
it into a preparation set that will be utilized to prepare the classifier, and
a test set that will be utilized to assess it. Normally, the information is
part utilizing 80% for preparing and the other 20% for testing.

Stage 5: Evaluating your classifier execution

Checking whether your classifier is completing a great job at
distinguishing spam or not. This is the Last advance and decides execution of
your classifier.


for Recommendations

Define problem statements

Load datasets

Generate a popularity model

Proceed with collaborative filtering model

Evaluate engine


1-Define problem statement based on type of recommendation engine needed to

2- There are multiple datasets on web which someone can use during evaluation
step. Depending on you model and the auxiliary information used (tags,
timestamps, ratings, etc.) you should choose the best dataset close to your

popularity based model, i.e. the one where all the users have same
recommendation based on the most popular choices.

The core idea works in 2 steps:

similar items by using a similarity metric

a user, recommend the items most similar to the items (s)he already likes

give you a high level overview, this is done by making an item-item matrix in
which we keep a record of the pair of items which were rated together.

this case, an item is a movie. Once we have the matrix, we use it to determine
the best recommendations for a user based on the movies he has already rated.
Note that there a few more things to take care in actual implementation which
would require deeper mathematical introspection, which I’ll skip for now.

5-Checking whether your model works properly ad correctly or not.



3.      Stock

Define problem statements


Train /
Test Split

Since we
always want to predict the future, we take the latest 10% of data as the test





1-Define problem statement based on type engine needed to build.


Step2- The stock prices are a time series of length, defined as
in which is the close price on day, imagine that we have a sliding window of a
fixed size (later, we refer to this as input_size) and every time we move the
window to the right by size, so that there is no overlap between data in all
the sliding windows.


Step3- The S 500 index increases in time, bringing about
the problem that most values in the test set are out of the scale of the train
set and thus the model has to predict some numbers it has never seen before. Sadly,
and unsurprisingly, it does a tragic job to solve the out-of-scale issue, I
normalize the prices in each sliding window. The task becomes predicting the
relative change rates instead of the absolute values. In a normalized sliding
window at time, all the values are divided by the last unknown price—the last
price in:


Step4- The training requires max_epoch epochs in total; an epoch
is a single full pass of all the training data points. In one epoch, the
training data points are split into mini-batches of size


4.      Automated

Define problem statements

Load datasets

Generate a popularity model

Proceed with collaborative filtering model

Evaluate engine


Step 1-Problem statement should be according
to the application needed to be built

Step2-Load the training data of the
application u want to build.Every varied application consist of varied data

Step3-generate a general model where common
data sets needs to be compared and build.

Step4-Once general model is build proceed
with A more deep model.

Step5-After completion of the engine evaluate
i.e. analyse ur engine whether it is up to the mark or not.