Homework 5 Name: Instructions Please work independently Please note that all homework assignments must be submitted using the “Assignments” tool in Blackboard (NO EXCEPTION) Please keep a backup copy of….
Business Analytics – Third Assignment
Note: You need to submit your answers in a word document. You need to transfer the results from the excel file into the word document. In addition, you must submit your Excel file (we prefer a single excel file with one or multiple worksheet for each question) but note that only the word document will be marked. If you think there is any issue or unclarity in any question, please make your assumptions (if there is any) and clearly explain them in your report.
You need to add the coversheet and sign it. Please write the name of your tutor as well as the name of your lecturer in the coversheet.
The analyses and the answers must be your own individual work without consultation of any other person. Also, you are not allowed to help/advise other students.
- Open excel sheet which is called worksheet “Q1” of the Excel file (provided for this assignment) and develop the following visualizations: (3+3+3=9 marks).
- A figure showing whether there is a relationship between age and income.
- A figure showing distribution of income.
- A figure illustrating registration date and frequency of people who are registered.
- What is the meaning of outliers in a dataset, how we can detect it and how we can deal with it. (6 marks).
- Assume there are two explanatory variables (X1 and X2) in a logistic regression model.
- X1 is a categorical variable with levels including very low, low, average, high and very high
- X2 is a categorical variable with levels including Sydney, Melbourne, Hobart and Brisbane.
Explain how you will use these variables in developing a logistic regression model. How many coefficients will you have in the final model? (5 marks).
- The data presented in worksheet “Q4” is the results of a 4-year study conducted to assess how age, weight, and gender influence the risk of diabetes. Risk is interpreted as the probability (times 100) that the patient will have diabetes over the next 4-year period.What predictive model you suggest relating risk of diabetes to the person’s age, weight and the gender. Why? (5 marks)
- There are 500 client records in the worksheet “Q5” of the Excel file (provided for this assignment) who have shopped many special products from an e-Business website. Each record includes data on types of product purchased (between 1-5), purchase amount ($), age, gender, family size of the customer, whether the client has a membership and whether the customer has a discount card. (8+7=15 marks)
- Develop a multiple regression model to predict the spend amount based on other variables in the data set. Write the final equation and interpret the coefficients of the model and explain the accuracy of the model?
- What would be your recommendation to improve the accuracy of this model? What is your recommendation to improve the simplicity of the model?
- The following screenshot is taken from the logistics regression output from the data set “credit card”. You can find the data set here. The response variable that is called “card” is a binary variable which is considered as success (yes or 1) if the application of the customer for a credit card is accepted. (8+7=15 marks)
- Write the logistics regression equation based on the output?
- In the Excel sheet “Q6” you can find the actual values of dependent variable versus the prediction values. Calculate overall error, sensitivity, and specificity. Explain the steps of calculations.
- Consider the following confusion matrix as the result of a logistic regression model that aims to classify a dependent variable y (whether the patient is having heart attack or not) based on a given set of independent variables. In this model, y = 1 indicates having a heart attack and y = 0 indicates not having a heart attack. The cut-off value is considered as 50%. Do you think we need to make a change to the cut-off value? Justify your answer (5 marks)
Actual clas 1 2700 1000
0 70 3068
- In worksheet “Q8”, a dataset from blood bank is presented. The data are recorded for blood donation made by a group of donors of in a period of time. The donor ID is unique for each donor. A donor might have donated more than once in this period. At each donation, the blood total protein level of the donor has been recorded. There are some missing values for blood type. How you can fill in the missing values. Explain your approach (step by step) on how you will fill in the missing values. Apply your approach to fill the missing value as much as possible. (save the results in an Excel worksheet in and name it Question 8.) (5 marks)
- Open the Excel worksheet Q9. In this data set there are 12 different observations. In the same data set x and y represent independent and dependent variables, respectively. Develop a regression model which best represents the data set. You can develop a linear or a polynomial regression model. However, you cannot exceed a degree 3 exponential regression model. In other words, your options are limited to degree 1, 2 and 3 regression models. (5+5+5=15 marks)
- Developed a scatter chart and explain based on the scatter chart what kind of regression model is potentially suitable for development?
- Develop potential regression models and choose the best one. Write the regression equation which you think best represents your data set. Explain why the model you picked is the best one.
- Develop residual plots for all of the regression models and explain if they can help you in choosing appropriate regression model or not.
- Assume that you are a business analyst in a manufacturing company. Your manager gave you a task to optimise scheduling of a project. You are asked to minimize the overall manufacturing time. There are three items to be manufactured and three different machineries are available to be used for the manufacturing task. Due to operational reasons every machinery can only be allocated to the manufacturing of one item. Every item can only be manufactured by one machine. Different machinery has different speeds in manufacturing different items. Let us say represent item and machinery, respectively. represents the first item represents the second item and so on. represents the first machine, represents the second machinery and so on. The duration of manufacturing every item pertaining to every machinery is presented in the below table. (10+10=20 marks)
In this problem the sequence of manufacturing is important. Items and should be manufactured before manufacturing item .
- Write the linear optimisation model for the company to make the best decision.
- Solve the model, present the results and interpret them.
Hint: you can use a binary variable such as which can take values of zero and one. Let us say if machine is engaged to manufacture item otherwise .
TOTAL MARKS= 100
100 marks will be converted to 40% of weighting for the course