Question 1

Suppose we have a simplified language with:

  • Three vowels :
  • Three consonants: Let map consonants to range Let map vowels to range . Suppose we estimated the joint PMF for vowel and consonant co-occurrences within syllables, as given below.

Calculate the following :

Entropies H(X) and H(Y)

Joint entropy H(X, Y ) :

Conditional entropies H(X|Y ) and H(Y|X)

Mutual information I(X; Y )

Question 2

As a data scientist in a telecommunication company, your task is to analyse a customer dataset to predict whether a customer will terminate his/her contract. The dataset consists of around 8000 customer records, each consisting of one binary dependent variable , indicating whether the customer terminates the contract or not , and 19 independent variables, which include the customer’s information, e.g., age, subscription plan, extra data plan, etc., and consumer behaviour such as the average number of calls and hours per week. Since your boss needs some actionable insights to retain customers, you decided to use interpretable machine learning methods. Design your interpretable machine learning method by answering the following questions:

You have implemented a feature selection algorithm based on mutual information to select the most informative features from the 19 independent variables. To validate the implementation of your mutual information calculation function, you use a small subset of the data to calculate mutual information manually. You select one independent variable subscription plan, denoted as , which takes two values, . Please use the following Probability Mass Function table:

Calculate :

Entropy .

Entropy .

Joint entropy H(S,Y) :

Conditional entropy H(S|Y) :

Conditional entropy H(Y|S):

Mutual information I(S; Y) :

After applying your algorithm, you selected two variables:

  • Extra data plan , which is a binary random variable that indicates whether the customer subscribes to the extra data plan or not .
  • Averaged hours used per week , which is a continuous random variable. You then built a logistic regression model to classify customers into low risk or high risk of terminating the contract. The fitted model is:
  1. Given a customer x who has the extra data plan and spent on average 0.5 hours per week, calculate the odds and the probability that the customer will terminate the contract .

Question 3

As a machine learning expert for an AI cybersecurity company, your task is to design an automated network intrusion detection system. You have collected a large number of records of network activities. Each record includes the log information about network activity, such as protocol types, duration, number of failed logins, which are random variables, denoted as . Each record also includes a binary random variable Y called label that was labelled by cybersecurity experts as intrusions or normal connections . Answer the following questions about Feature Selection Based on Mutual Information.

  • Explain to your colleague, who knows nothing about information theory, the concept of mutual information. Ans - Mutual information measures the amount of information that one random variable provides about another. It quantifies how much knowing one variable reduces uncertainty about the other. High mutual information indicated string dependence between variables, while low mutual information suggests independence.
  • Explain the loop in the pseudocode of Table 1.

Ans - The two lines in the loop are used to select features. In this loop, we find the feature fmax which achieves the maximum mutual information among all the remaining independent variables in set . However, some features may be highly correlated with each other, hence selecting them will increase the number of features without improving the prediction. Therefore, we make sure that there is minimal redundancy between the candidate feature and the set of selected features . That’s exactly what the second term on the RHS achieves in line 5. We then add this feature into the set and subtract it from set , and repeat until we got features.

Question 4

A company wants to classify whether a customer will purchase a product ( = Yes) or not ( = No) based on categorical features. You have access to the following dataset:

Age GroupIncome LevelPrior PurchasePurchase (Y)
YoungLowNoNo
YoungLowYesNo
YoungHighNoYes
Middle-agedLowNoNo
Middle-agedHighNoYes
SeniorLowNoNo
SeniorHighYesYes
SeniorHighNoYes
  • Calculate the entropy of the target variable .
  • Compute the information gain for splitting on the feature β€œPrior Purchase.”
    • For Prior Purchase = Yes : [No, Yes]
    • For Prior Purchase = No : [No, Yes, No, Yes, No, Yes]
  • If you were to build a decision tree, which feature would be the best root node? Justify your answer. The best root node should have the highest information gain. β€œIncome Level” has maximal IG, so it should be selected as the root node.
  • Discuss how overfitting can be avoided in decision trees and suggest techniques to improve generalization.
    • Pruning : Remove branches that add little classification value. Use a validation set to find how much to prune.
    • Limiting Depth : Restrict maximum depth of the tree. Use a validation set to find when to stop growing the tree.
    • Applying ensemble methods : Use bagging or boosting.