MS987-Optimization for Analytics - IT Computer Science Assignment Help

Download Solution Order New Solution
Assignment Task


Task 

You are recruited by a major management consultancy firm, who have recently set up a unit for data science consultancy that develops data-driven predictive analytics and decision making tools for a broad range of clients ranging from healthcare industry and governmental organisations to airlines and logistics providers. The data science consultancy unit does not only provide analysis or solutions for a specific problem satisfying to a client’s current needs, but they also carry in-house research with partners from universities and research labs to investigate opportunities that may give them the competitive advantage of first-entry into market. The unit already houses a good number of technical experts, including data scientists, optimization experts and statisticians, as well as management consultants and a sales team, and have significant plans for growth over the next year.

The unit has been increasingly getting more involved with projects that requires optimization expertise, and although they are in the process of recruiting more optimization experts in-house, they have a number of urgent projects with due dates approaching and hence they need your expertise in this area. The two specific projects they require your help are: 1) an in-house project to efficiently diagnose specific cancer patients, and 2) a consultancy project for an external client to help them make various operational and strategic decisions. These two projects are described in detail in the following two parts, including background and expectations for each part.


PART 1: Helping Diagnosis for Cancer Patients

A range of medical methods are currently available for cancer diagnosis. These methods vary not only from simpler and cheaper ones to those requiring more time or resources, but also in their effectiveness and between different types of cancers. Due to these variations as well as a specific
patient’s needs, one method may be more preferable to use than others (or in combination with another method to be conclusive), in order not to delay the diagnosis – whether it is “benign” and hence releasing the stress on the patient, or “malignant” and hence urging the start of the necessary treatment as soon as possible.
Debasis Krishnan and Sarah Lu, two data scientists with optimization expertise, are currently working on a specific diagnostic procedure called “Fine-needle aspiration” (FNA), which is a type of biopsy. In this procedure, a thin, hollow needle is inserted into a mass (such as a lump) for sampling of cells, and the sample of cells are examined under a microscope. Using this technique, digitized images are used to compute a range of attributes of the cell nuclei present, such as average/variation in radius, texture, area and compactness. 


Sarah and Debasis obtained some real data1 on the use of FNA, as it can be seen in the data file called “PatientData.txt” (see “Data File for Part 1” on MyPlace). This file contains data for 569 patients, where each row of the data refers to a single patient, and each row contains 32 entries, first entry being a patient identifier number (anonymized), second entry being the actual diagnosis (B = benign or M = malignant), and the remaining 30 entries showing real-valued data as measured. For each patient, 3 cell nucleus at 3 different depths are sampled, and the following ten FNA attributes are measured for each sampled cell nucleus:
1. radius
2. texture
3. perimeter
4. area
5. smoothness
6. compactness
7. concavity
8. concave points
9. symmetry
10. fractal dimension

Sarah and Debasis would like to be able to compute a linear function defined over the 30 measurements provided, which can then be used to classify each patient as correctly as possible. They already worked out some technicalities, in particular that what they are trying to achieve is a so-called “separating plane” (say, f(x) = wx – ? = 0, or equivalently wx = ?, where w and ? are the decision variables, and x indicates the vector of 30 measurements) and they can use linear programming for this purpose. The idea of a separating plane is that you identify a line which separates one
class of data (say, benign) from the other (say, malignant), as much as possible. For example, see a very simple example with only 2 attributes (x- and y-axis) and 8 patients (4 in each class, blue for M and red for B) in the figure on the left: here, both the black and red lines show feasible separating planes that are perfectly capable to classify data, since all blue (M) points satisfy f(x) > 0 (as they are above the line) and all red (B) points satisfy f(x) ? 0 (as they are below the line). Of course, this simplistic example highlightstwo important issues (in addition to being only two-dimensional and hence not covering all 30 measurements), which we discuss next:

1) What if there was a blue dot in the middle of red dots? Of course in this case neither of the above two lines (nor any other line you can come up with yourself) would have been a “perfect” separating plane. However, remember our aim is not to find a perfect one, but rather the best we could get and either of the two lines above would have still worked for that purpose, with one misclassified data point (i.e., the new dot just added to the graph) and all other eight points classified correctly. Our objective would be to minimize such misclassified points (more on this later).

2) What if there are multiple separating planes? As we already see in this simplistic example with two easily separable sets of data, there is indeed an infinite number of separating planes you can come up with. Although any of these planes would have worked to classify the data, considering we have limited data on hand, it is wise to pick the plane that is “farthest” from both sets of points (or “most central” in between two data sets), since this will more likely to reduce any misclassifications in the future as it will increase the separation of the two sets (more on this later).
Before addressing these issues, we will define some notations and explain some concepts. First of all, recall the definition of the separating function:

where the data vector x has 30 dimensions (as there are 30 measurements for each patient), and hence decision variable vector w will also have 30 dimensions (whereas ? is only a scalar, i.e., it has only one dimension). Note that we can rewrite this expression as follows:

 

Let B and M indicate the set of patients with benign and malignant diagnosis, respectively. For the data vector x of any patient in M, we would like to have f(x) = wx – ? > 0, whereas for the vector x of any patient in B, we would like to have f(x) = wx – ? ? 0 (but it is likely these conditions will be not satisfied for some of the patients – hence misclassifications). Sarah and Debasis worked out that to address the first issue on hand, they need to measure an “error” only if a given point is on the wrong side of the line (which should be 0 otherwise), and minimize the sum of all these errors. Working out the maths in order to write this as an LP, they note that:

1) For the data vector x of any patient in M, we will have a constraint in the form where y is the nonnegative variable measuring the error for this specific patient;

2) For the data vector x of any patient in B, we will have a constraint in the form where z is the nonnegative variable measuring the error for this specific patient;

3) The objective function will be the minimization of the summation of all error variables over all patients.
To address the second issue, Sarah and Debasis realized that the objective function can be updated simply by adding |w| (i.e. the absolute value of vector w, or  in mathematical terms) to the objective function to be minimized. Of course they are aware that absolute value is not a linear function, but they believe this can be linearized easily by tweaking the LP stated above slightly. (Hin You might recall the linear regression problem from the class and our modelling approach there.)


Sarah and Debasis expect the following actions for this project:
a) You should first read the data for the first 500 patients from the input file using FICO Xpress Mosel. When doing so, ensure you record the diagnosis for each patient, as well as the measurements for 30 attributes (you do not need patient identifiers). 

b) Using the data, build an LP model as described above in FICO Xpress Mosel, where you ensure the data for each patient is used in the correct set of constraint, and solve it to find the best values for w and ?. Comment throughout your file where appropriate. 

c) Ensure your FICO Xpress model outputs into an output file the values of w and ?, as well as optimal error values for each patient (preferably only for misclassified patients, that is, only error values that are not zero).

d) Write a memo to Sarah and Debasis (at most 2 pages with standard margins, 11pt Arial or Calibri, 1.5 line spacing), briefly explaining key aspects of your FICO Xpress model (such as sets you built, how you structured constraints correctly, etc.) and any key messages regarding model outputs (in particular, if any of the original 10 attributes are better predictors than others, independent of which of the cell nucleus). It is perfectly fine to refer to your FICO Xpress model or output file when writing your memo. (Note: A memo is an ‘informal letter’ written to someone to give key messages; also remember Sarah and Debasis are technical experts so you can certainly use technical language if you want.) (8 marks)

e) If time permits, Sarah and Debasis would like you to test your function on the remaining 69 patients from the input data. This will require you to read their data and then plug in the optimal values of w and ? to calculate error values, which should be also written into the output file. Note you will not resolve the LP. If you decide to do this part, add a sentence to your report from part d), noting how many patients are misclassified.


PART 2: Improving Operations and Strategic Decision Making

An important client of the data science consultancy unit is a software company, which produces solutions to a range of organisations, from a number of city councils in Scotland to rail companies and airlines, in order to develop analytical solution methods for a number of challenging decision making problems stemming from transportation related daily operations and strategic decision making. A team led by Shona McInnes and Dimitrios Robertson require your help by addressing some of these challenges, including enabling them to build some generic tools they will likely employ in upcoming projects from their broad range of clientele. While Shona and Dimitrios have quantitative backgrounds (mathematics and computer engineering, respectively), their knowledge of optimization is a bit limited, and hence, your guidance is also essential in that respect.

Shona and Dimitrios have recently obtained a data file for a setting of a home care provider2 in Glasgow (a comma-delimited file named “Distances.csv”), which details walking distances (in seconds) between 233 different locations, each location identified with a UserID. Essentially, home care services are provided to elderly, disabled or recovering/sick citizens in the city limits, which range from providing medication and preparing their meal to lifting them from their bed for some basic activities.Although this data is specific for home care services in Glasgow, they believe that they may develop generic tools using this data to build customized solutions for their client. As you can observe, this data file has simply a matrix of distances, where each row and each column provide the information for the associated UserID’s (and the value in the matrix simply indicates the walking distance in seconds.) Also note that some of the entries in the matrix are zero, which means there is no information on these distances (or UserID’s of the associated row and column are the same) and hence they assume that there is no direct connection for these links.

Based on a number of recent or current projects they have been involved in, there are a number of particular scenarios of interest to Shona and Dimitrios. First of all, they had a recent project, where a similar network was considered with a number of centres (or ‘depots’ in the network optimization language), that is, the locations where the home care staff are based at, and the rest of the locations in the network representing the locations of the homes of the citizens needing the home care services (or ‘clients’). In such a scenario, a question of interest was to determine the sum of all shortest distances from each client to any of the depots (note that it is not the shortest distance to each of the depots, but rather, only to the shortest distance to the nearest depot.) Shona and Dimitrios, based on their limited optimization knowledge, believe this can be simply modelled as a network flow problem – not sure how though, and they certainly need your help building this. 

There is also a separate and upcoming project extending from this earlier project, where they will need to make also various strategic decisions (but more details on this later in part d.) In a current and more challenging project regarding daily operations, they faced the issue of deciding how to route each home care staff. In this project, the duration of each task (or ‘visit’) is safely assumed to be 15 minutes. Essentially, each staff member will arrive to one of the centres in the morning, and then start their shift of 8 hours work by walking to their first client, carry out the necessary task in this first visit, then walk to their second client, carry out the necessary task in this second visit, and so on, until they visit their last client and return to the centre they started at, before their 8 hour shift ends. Shona and Dimitrios read about the famous travelling salesman problem (TSP) recently, and they realize that even that problem with a single salesman is very hard to solve, and they are dealing with a situation involving multiple ‘salesmen’ in this scenario, so they are thinking about developing some heuristic approach and they will share their ideas on this with you later in part c. 


Shona and Dimitrios expect you to carry out the following actions for this project:

a) Using the file “Distances.csv”, Shona and Dimitrios consider a scenario with six locations (UserID’s 71, 142, 280, 3451, 6846, and 7649) being ‘centres’ (or ‘depots’), and the remaining 227 locations indicating where the ‘clients’ are located. For their first modelling task for finding the sum of all “shortest distances from each client to the nearest of the centres”, Shona and Dimitrios think that they can build a balanced minimum cost network flow model using the distances provided (and appropriately setting the net supplies for each node), but they are not sure if, e.g., they will need dummy nodes or arcs(Hint: it’s very likely they need.). They ask your help to build such a network optimization model in FICO Xpress Mosel (note you are not using any algorithm, but solve just a single Xpress model). Make sure your model reads all input files (in the same fashion as you have seen in the class for other types of input files), your network is balanced, and nearest centres are not identified manually (correct model would not need that). Comment throughout your file where appropriate. 

b) Ensure your FICO Xpress model from a) outputs into an output file of type .txt with 2 columns and 6 rows (plus any header rows, if you prefer), where the first entry in each row should be the UserID of each centre and the second entry in each row should be “the number of clients who has this particular centre as their nearest centre.” Also write a two-page memo to Shona and Dimitrios (within standard margins, 11pt Arial or Calibri, 1.5 line spacing) including a brief explanation of your model and how you balanced it (include a drawing if you prefer so), and any key messages regarding model outputs, including whether you obtained integer flows and why. (Hint: The second entry in your output file may sound complex, but if you have done the correct modelling approach, this will be obvious, and you will not need anything complex such as backtracking individual paths.)

c) Shona and Dimitrios would like to use the distance matrix from the previous part to design and develop a generic routing algorithm, which they plan not only to use currently with their client but also for similar problems with other clients in the future. First of all, they assume that the locations of the centres are not given as in part a), but rather the algorithm also chooses the best locations of centres from all 233 locations, given how many centres to be set up (so that once you choose a location as a ‘centre’, you can assume there is no client there.) As in part a), they assume there will be 6 centres in the network, and remaining 227 locations will act as clients. For choosing which locations would be the “best” centres, they have two key criteria: the connectivity of a node to the rest of the network (i.e., how many other nodes it has a direct connection with), and the average distance of a node to the rest of the network. 

 

They are not sure how to proceed with this but one crude heuristic idea they have is as follows:

STEP 0. Initialize the set of ‘centres’ as an empty set, the set of ‘clients’ as all nodes in the network, and define a threshold parameter for “short walking distance”. GO to Step 1.
STEP 1. Calculate the connectivity and average distance of each node in the set of ‘clients’ to the rest of the ‘clients’. Identify the most connected node(s), and if there is more than one such node, pick the node i with the smallest average distance. Add node i to set of ‘centres’ and remove it from the set of ‘clients’. If the set of ‘centres’ has already 6 nodes, STOP. Otherwise, GO to Step 2.


STEP 2. For each node j that is in the set of ‘clients’, check if its distance to node i is less than the threshold parameter. If so, remove it from the set of ‘clients’. If the set of ‘clients’ is not empty, GO to STEP 1. Otherwise, GO to Step 3.

STEP 3. Pick as many random nodes from nodes not in the set of ‘centres’ as necessary to have 6 nodes in the set of ‘centers’ and STOP.
Shona and Dimitrios believe this simple algorithm would be sufficient enough to identify the “best” centres. They are not sure how to set the threshold parameter, which acts as an eliminator of nodes that are already very close to a centre so that remaining nodes get more attention for picking the remaining centres. You might simply pick a random number (e.g. 5 minutes) for this threshold, or experiment with e.g. 3/4/5 minutes and come up with a “better” choice, or propose a way to calculate it using the given data. Using the 6 centres identified with the previous algorithm, the distances provided, and 20- minute standard duration for each visit at a client, they want to generate valid routes for each staff member, starting their 8-hour shift at one of the centres. Since they know this problem will be very complex to model and solve as an integer programming problem, they have some ideas for building up a heuristics. One idea they have is as follows:

STEP 0. Label all client nodes as “not served”. GO to Step 1.

STEP 1. If all client nodes are labelled as “served”, then STOP. Otherwise, start at one of the centres with a new staff member, initialize the work duration of this staff member as zero, keep track of each node on this staff member’s route, and GO to Step 2.

STEP 2. From the current node, pick the nearest neighbour client node j that is “not served”. If the current work duration, the walking distance to j and the task duration add up to strictly more than 8 hours, then this staff member has already completed their shift (without visiting j), GO to Step 1. Otherwise, GO to Step 3. 

STEP 3. Add the node j to this staff member’s route, and add the walking distance to j as well as the task duration at j to the current work duration. Update the label of node j from “not served” to “served”. GO to Step 2. Shona and Dimitrios realize that Step 1 is a bit tricky, since they do not know which centre is best to start with. They think this might be done randomly, but a more prominent way would be picking a centre that is more “promising” – for example, in a similar fashion to the previous algorithm that identified the 6 centres, you can measure average/total distance from each centre to all client nodes that are “not served” yet, and then pick the centre with the smallest such distance. They also think there might be other improvements you may apply to this heuristic algorithm (for example, in Step 2, what if there are client nodes that are “not served” yet but they are not a neighbour of the current node?)

They would like you to implement these two algorithms (with any improvements you might suggest) using either Mosel, Python, MATLAB or C/C++, and also run tests for comparison (in particular, picking a random centre in Step 1 vs. applying their idea of identifying the ‘best centre’, but you are welcome to test any other ideas you might have.) Your code should becommented wherever necessary (also ensure to take care of different time units!) Your code should output at least the routes each staff member followed, and the total number of staffrequired to cover all clients. Also write a memo to Shona and Dimitrios (up to three pages within standard margins, 11pt Arial or Calibri, 1.5 line spacing) to discuss any algorithmic details including any improvements you suggest and any key messages regarding modeloutputs. (Note 1: The first algorithm for identifying best centres carries a weight of 12 marks and the second algorithm for routing staff carries a weight of 18 marks. You can work on either/both algorithms, and if you prefer to skip the first algorithm and work only on the routing algorithm, then simply pick 6 random centres to start with. Note 2: You will see an
algorithm in week 9 that would be useful for the routing algorithm.) 


d) Shona and Dimitrios are aware that there may be some future strategic plans to incorporate into the decision-making process, which their client would like to explore and integrate if possible. Using the original network from part a) with 6 given centres (no capacities on any of these centres) and ignoring staff routing (of part c), they would like to focus on travelling times in the network only. Essentially, they consider a minimum cost network flow problem setting to start with, where each of the 227 clients has exactly one unit of demand, and the 6 centres provide any necessary supply. There are two strategic investment options considered in order to reduce travel times of staff (note that you do not consider task durations in this case, as your focus is only travel times.)

In order to address the investment possibilities, they are thinking that one can build a mixed integer programming (MIP) model. The first investment possibility is a corporate membershipto the city’s bike network that would allow their staff to use bikes instead of walking from client to client. This investment would have a fixed charge of £55 per day (all demands in the network are daily) and they are aware this is a binary decision. If bikes are used, they estimate that this would not affect travel time on any arc with a distance less than 10 minutes (or 600 seconds), but will reduce all travel times on arcs longer than 10 minutes by 40%, and all travel times on arcs longer than 20 minutes by 60%. The second investment opportunity is a corporate membership to the city’s electric scooter network, which has a daily fixed cost of £125, and will cut all travel times on arcs longer than 5 minutes by 75%. Note that they will not consider having both corporate memberships simultaneously. Also note that staff are paid £40 per hour (if you save, say, 9 hours of travel time from the total distance covered in the network in a day, then that means you have a saving of staff time worth £360.)

Shona and Dimitrios think that modelling this problem may require further decision variables, in addition to variables representing binary decisions and flows on arcs. One observation they make (though they are not fully clear) is as follows. Let’s consider the arc between UserID’s 71 and 142, with a walking distance of 2110 seconds. If it is decided to invest in the bike membership, this would help to save the following monetary quantity on this specific arc: (0.60 x 2110) seconds x £40/3600 seconds x X71,142

Note the first term is simply the saving of 60% of the distance (since it is longer than 20 minutes), the second term is hourly rate per staff (or, equivalently, per 3600 seconds), and finally X71,142 indicates the continuous decision variable representing the flow is on this arc.
Shona and Dimitrios are aware that saving quantity could be defined as a decision variable but then it cannot be multiplied with a binary variable, as that would have resulted in a nonlinearmodel (note that if the decision was not to invest in bikes, then the saving would have been simply zero.) They expect you to help them to build this MIP model in FICO Xpress Mosel, and write a two-page memo (within standard margins, 11pt Arial or Calibri, 1.5 line spacing), explaining if your model works and how you calculated any big-M numbers necessary, and what you recommend them with regards to model output. (Note: They are aware that the original model with all 233 locations can be very big, hence they are happy with you limiting the Xpress solution time to 600 seconds if necessary.)

 

This MS987-IT Computer Science Assignment has been solved by our IT Computer Science Expert at My Uni Paper. Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+ Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing Style. Be it a used or new solution, the quality of the work submitted by our assignment experts remains unhampered. 

You may continue to expect the same or even better quality with the used and new assignment solution files respectively. There’s one thing to be noticed that you could choose one between the two and acquire an HD either way. You could choose a new assignment solution file to get yourself an exclusive, plagiarism (with free Turn tin file), expert quality assignment or order an old solution file that was considered worthy of the highest distinction.
 

Get It Done! Today

Country
Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
+

Every Assignment. Every Solution. Instantly. Deadline Ahead? Grab Your Sample Now.