ETC3250 5250 - Introduction to Machine Learning Assignment

Download Solution Order New Solution

Assignment Task

Instructions

  • This is an  individual  assignment. While you can discuss the questions with others for the purposes of enhancing your learning,  your submission must be your own work . Identical responses are not allowed and will get zero marks. Please note the Monash Student Academic Integrity Procedure.
  • A skeleton   file is provided for you to complete and turn in.
  • You must submit your R Markdown , its  output and the data file  to Moodle. That is,  three files  in a zip file need to be submitted. No other formats will be accepted.
  • This assignment is meant to be  reproducible  and your answers will be checked against our template answer, so make sure you answer the questions in order. It is expected that the knitting the R Markdown file will produce the html file submitted. If the R Markdown file does not knit, then your score will be reduced.
  • Original work is expected. Any material used from external sources needs to be acknowledged.

Data

This assignment is based on the data from the Kaggle competition to predict restaurant revenue. Each observation corresponds to a restaurant and the variable that you want to predict is .

Each of you will be provided with a  unique dataset  based on this data (note: it is not exactly the same as the original).  Download this data from this app . You should have 2,000 observations and 12 variables in your data. Make sure you get the correct data by entering your student ID into the app – mark penalties will apply for using the wrong data.

The data contains the following variable.

  • Restaurant id.
  • opening date for a restaurant
  • Type of the restaurant. FC: Food Court, IL: Inline, DT: Drive Thru, MB: Mobile
  • These are obfuscated variables that are either demographic, real estate or commercial information related to the restaurant.
  • The revenue column indicates a (transformed) revenue of the restaurant in a given year. Please note that the values are transformed so they don’t mean real dollar values.

Task

With over quick service restaurants across the globe, TFI is the company behind some of the world’s most well-known brands: Burger King, Sbarro, Popeyes, Usta Donerci, and Arby’s. They employ over people in Europe and Asia and make significant daily investments in developing new restaurant sites.

Right now, deciding when and where to open new restaurants is largely a subjective process based on the personal judgement and experience of development teams. This subjective data is difficult to accurately extrapolate across geographies and cultures.

New restaurant sites take large investments of time and capital to get up and running. When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred.

Finding a mathematical model to increase the effectiveness of investments in new restaurant sites would allow TFI to invest more in other important business areas, like sustainability, innovation, and training for new employees.

Your challenge is now to investigate models to predict the revenue and provide a recommendation to TFI which model to use for future predictions. Your supervisor has given you the following set of questions for you to answer as to guide you with your investigation.

A. Preliminary analysis

  1. Load your downloaded data into R as a object, ensuring each variable is encoded appropriately, and display the first 10 rows of your data set.  

  2. Construct a new variable called  that corresponds to the age of the restaurant (in years) at 1st January, 2015. Show the histogram of this  variable.

  3. Produce a pair-wise scatter plot of each  numerical  variable against the response. What do you notice from the plot? Make another plot for each of the numerical variable against the response that better shows the relationship between the two variables.

  4. Produce a numerical summary of all the variables in the data set.  

  5. Using the preliminary exploration in questions 1 to 4, do you observe any patterns in the data? Should you use the variable  and  in your predictive model? Explain your answer. 

B. Regression

  1. Remove the variables  and   from the data and select 70% of the observations to be used as the training data and the remaining data as the testing data.  

  2. Use the training sample to estimate a multiple linear regression model for   in terms of all the predictors. Show the summary of this model fit. Discuss how well this model fits the data

  3. Consider a model for  with all predictors  except   Show the summary of this model fit. Compare this model with the fitted model in question B2 using a hypothesis test. Explain the results of this test. 

  4. Which of the two regression models considered (in questions B2 and B3) is best at predicting new records? Explain your answer.  

C. Subset selection

For this question use the training data from question B1.

  1. Consider the model in question B2 as the full model. Perform a backward elimination using BIC. Report the final selected model using this process.  

  2. Again consider the same model question B2 and perform now a step-wise regression using AIC. How is this model different to the one selected in question C1? 

D. Regularization

  1. Make an appropriate transformation to the training data for regularization methods. From this transformed training dataset, create a 5-fold cross validation dataset.  

  2. Using the dataset from question D1, select the optimal tuning parameter \(\lambda\) for lasso regression using the average root mean square error. You can for example use the search range for \(\lambda\) to be \([1, 10^{10}]\) (or in code use ) but you may need to vary this range depending on your data. You should  not  use any convenience function like to select \(\lambda\).  

  3. Fit an elastic net model with optimal \(\lambda\) and \(\alpha\) selected by cross validation root mean square error using the dataset from question D2. Recall that \(\alpha \in [0, 1]\) (in code you can use . Remember that you need to find a combination of \(\lambda\) and \(\alpha\) that minimises the average root mean square error. Again don’t use any convenience (i.e. one line) function. Discuss the results.  

E. Conclusion

Answer the following based on the results and the models you considered in questions A-D.

  1. What variables are important (or not important) in modelling the response? Explain your answer. 

  2. What model would you recommend to TFI for predicting new records? Give statistical reasons for your recommendation. 

This ETC3250/5250 - Machine Learning has been solved by our PhD Experts at My Uni Paper.

Get It Done! Today

Country
Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
+

Every Assignment. Every Solution. Instantly. Deadline Ahead? Grab Your Sample Now.