# data-mining-assignment-6

Please answer the questions below. They are picked from the textbook.

Chapter 2:

18.

This exercise compares and contrasts some similarity and distance measures.

(a)

For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard similarity is a measure of the similarity between two binary vectors. Compute the Hamming distance and the Jaccard similarity between the following two binary vectors.

x = 0101010001

y = 0100011000

(b)

Which approach, Jaccard or Hamming distance, is more similar to the Simple Matching Coefficient, and which approach is more similar to the cosine measure? Explain.

(c)

Suppose that you are comparing how similar two organisms of different species are in terms of the number of genes they share. Describe which measure, Hamming or Jaccard, you think would be more appropriate for comparing the genetic makeup of two organisms. Explain. (Assume that each animal is represented as a binary vector, where each attribute is 1 if a particular gene is present in the organism and 0 otherwise.)

(d)

If you wanted to compare the genetic makeup of two organisms of the same species, e.g., two human beings, would you use the Hamming distance, the Jaccard coefficient, or a different measure of similarity or distance? Explain.

Chapter5:

20.

Consider the task of building a classifier from random data, where the attribute values are generated randomly irrespective of the class labels. Assume the data set contains records from two classes, “+” and “−.” Half of the data set is used for training while the remaining half is used for testing.

(a)

Suppose there are an equal number of positive and negative records in the data and the decision tree classifier predicts every test record to be positive. What is the expected error rate of the classifier on the test data?

(b)

Repeat the previous analysis assuming that the classifier predicts each test record to be positive class with probability 0.8 and negative class with probability 0.2.

(c)

Suppose two-thirds of the data belong to the positive class and the remaining one-third belong to the negative class. What is the expected error of a classifier that predicts every test record to be positive?

(d)

Repeat the previous analysis assuming that the classifier predicts each test record to be positive class with probability 2/3 and negative class with probability 1/3.

Chapter 6:

17.

Suppose we have market basket data consisting of 100 transactions and 20 items. If the support for item a is 25%, the support for item b is 90% and the support for itemset {a, b} is 20%. Let the support and confidence thresholds be 10% and 60%, respectively.

(a)

Compute the confidence of the association rule {a} -> {b}. Is the rule interesting according to the confidence measure?

(b)

Compute the interest measure for the association pattern {a, b}. Describe the nature of the relationship between item a and item b in terms of the interest measure.

(c)

What conclusions can you draw from the results of parts (a) and (b)?

(d) NOT NEEDED FOR THE TEST

Chapter 7:

5.

For the data set with the attributes given below, describe how you would convert it into a binary transaction data set appropriate for association analysis. Specifically, indicate for each attribute in the original data set.

(a) How many binary attributes it would correspond to in the transaction data set,

(b) How the values of the original attribute would be mapped to values of the binary attributes, and

(c) If there is any hierarchical structure in the data values of an attribute that could be useful for grouping the data into fewer binary attributes.

The following is a list of attributes for the data set along with their possible values. Assume that all attributes are collected on a per-student basis:

• Year : Freshman, Sophomore, Junior, Senior, Graduate: Masters, Graduate: PhD, Professional

• Zip code : zip code for the home address of a U.S. student, zip code for the local address of a non-U.S. student

• College : Agriculture, Architecture, Continuing Education, Education, Liberal Arts, Engineering, Natural Sciences, Business, Law, Medical, Dentistry, Pharmacy, Nursing, Veterinary Medicine

• On Campus : 1 if the student lives on campus, 0 otherwise

• Each of the following is a separate attribute that has a value of 1 if the person speaks the language and a value of 0, otherwise.

– Arabic

– Bengali

– Chinese Mandarin

– English

– Portuguese

– Russian

– Spanish

Chapter 8:

1.

Consider a data set consisting of 2^(20) data vectors, where each vector has 32 components and each component is a 4-byte value. Suppose that vector quantization is used for compression and that 2^(16) prototype vectors are used. How many bytes of storage does that data set take before and after compression and what is the compression ratio?

9.

Give an example of a data set consisting of three natural clusters, for which (almost always) K-means would likely find the correct clusters, but bisecting K-means would not.

Is this the question you were looking for? Place your Order Here 