Highlights
Activity 1: Clustering cancerous tissue samples
In this first activity, we consider the problem of clustering tissue samples based on their gene-expression levels. This is a relevant problem in bioinformatics as it can help the discovery of different subtypes of cancer. In particular, we will use a dataset from the study in Golub et al. (1999), which contains 72 human samples with leukemia. The expression levels of 1868 selected genes have been measured in all these samples. The dataset thus contains 72 observations (rows) and 1868 variables (columns).
Important notes:
An ARFF file is basically a CSV file with some metadata at the beginning. The metadata could be removed manually using a text editor and then the data could be saved as a CSV file, but there is no need for that as this type of file can be straightforwardly read into a data frame in R using, for example, the function read.arff() from the package foreign . Use this option in your assignment. Once you read the file, you will notice in the resulting data frame that the dataset actually contains 1869 (rather than 1868) columns. In fact, there is an additional, rightmost column (column 1869, named Classe ), with class labels. These labels (‘1’ or ‘2’) indicate the subtype of leukemia associated with each sample. This information is available from specific domain knowledge. In particular, it is already known that there are 47 tissue samples of subtype ALL (class ‘1’) and 25 samples of subtype AML (class ‘2’). These class labels will not be used for clustering, but only for external assessment of the results. The goal is to assess to what extent the two subtypes of leukemia can be revealed as clusters in a completely unsupervised way.
You are asked to:
1. Read the dataset directly from the ARFF file into a data frame.
2. Set aside the rightmost column (containing the class labels) from the data, storing it separately from the remaining data frame (with the 1868 predictors).
3. Use the 72 × 1868 data frame to compute a matrix containing all the pairwise Euclidean distances between observations, that is, a 72 × 72 matrix with distances between tissue samples according to their 1868 expression levels. This matrix must be of type dist , which can be achieved either by using the function dist() from the base R package stats or by coercion using the function as.dist() .
4. Use the distance matrix as input to call the Single-Linkage clustering algorithm available from the base R package stats and plot the resulting dendrogram. Do not use any class labels to perform this step.
5. Use the distance matrix as input to call the Complete-Linkage clustering algorithm available from the base R package stats and plot the resulting dendrogram. Do not use any class labels to perform this step.
6. Use the distance matrix as input to call the Average-Linkage clustering algorithm available from the base R package stats and plot the resulting dendrogram. Do not use any class labels to perform this step.
7. Use the distance matrix as input to call Ward’s clustering algorithm available from the base R package stats and plot the resulting dendrogram. Do not use any class labels to perform this step.
8. Compare the dendrograms plotted in Items 4 to 7. Visually, the dendrograms suggest that some clustering algorithm(s) generate more clear clusters than the others. In your opinion, which algorithm(s) may we be referring to and why? In particular, in which aspects do the results produced by this/these algorithm(s) look more clear? Perform Item 9 below only for this/those algorithm(s).
9. Redraw the dendrogram(s) for the selected algorithm(s) in Item 8, now using the class labels that you stored separately in Item 2 to label the observations (as disposed along the horizontal axis of the dendrogram). Do some prominent clusters in the dendrogram(s) correspond approximately to the classes (that is, the two subtypes of leukemia)?
10. Repeat the analysis, now using normalised data. The 1868 predictors have not been normalised before computing the distance matrix in Item 3. Normalisation is a non-trivial aspect in unsupervised clustering, as there is no ground truth to assess whether or not it improves performance. On the one hand, it may prevent variables with wider value ranges to dominate distance computations, but on the other hand it may distort clusters by removing natural differences in variance that help characterise them as clusters. Normalisation thus becomes an aspect of Exploratory Data Analysis when it comes to clustering: the analyst will usually generate and try to interpret results both with normalised and non-normalised versions of the data. The type of normalisation depends on the application in hand. Here, we are computing Euclidean distance between rows of the dataset, so the type of normalisation that applies is typically the so-called z-score normalisation of columns, where each column is rescaled to have zero mean and standard deviation of 1. In this item, you are first asked to normalise the data this way before computing the distance matrix in Item 3.Then, repeat Items 4 to 9. Does normalisation improve or worsen the results in this dataset?
Note: Since the data is high-dimensional (1868 dimensions) and very sparse (only 72 observations), density- based algorithms are not expected to recover any clusters because the notion of density vanishes away in sparse data embedded in very high-dimensional spaces, due to a well-known phenomenon called the curse of dimensionality. For this reason, we are not trying density-based methods in this first activity. Activity 2: Clustering genes (Part A)
In this second activity, we consider the problem of clustering genes according to their gene-expression levels across different conditions in a controlled experiment. The goal is to identify genes that show similar expression patterns over a wide range of experimental conditions. This is a relevant problem in bioinformatics as it can, for example, help identify genes that share the same regulatory mechanisms or functions in an organism. In particular, we will use a dataset YeastGalactose from the study in Yeung, Medvedovic, and Bumgarner (2003), which is composed of the gene expression levels of a subset of 205 selected genes of the yeast Saccharomyces cerevisiae from 20 different measurements (experimental conditions). The dataset thus contains 205 observations (rows) and 20 variables (columns). It is available in the file yeast.arff.
Important note: Once you read the file, you will notice in the resulting data frame that the dataset actually contains 21 (rather than 20) columns. In fact, there is an additional, rightmost column (column 21, named
Classe ), with class labels. These labels (‘cluster1’, ‘cluster2’, ‘cluster3’ and ‘cluster4’) indicate genes whose expression patterns reflect four functional categories. Thus, there are four known categories of co-regulated genes in the data. This information is available from specific domain knowledge. These labels will not be used for clustering, but only for external assessment of the results. The goal is to assess to what extent the four categories of genes can be revealed as clusters in a completely unsupervised way.
You are asked to:
11. Read the dataset directly from the ARFF file into a data frame.
12. Set aside the rightmost column (containing the class labels) from the data, storing it separately from the remaining data frame (with the 20 predictors).
13. Use the 205 × 20 data frame to compute a matrix containing all the pairwise Pearson-based dissimilarities between observations, that is, a 205 × 205 matrix with dissimilarities between genes according to their 20 expression measurements. Important Note: It is well-known that co-regulated genes are better characterised by similar trends in their gene expression profiles, rather than similar expression levels in terms of their absolute values. In other words, the similarity between genes in terms of their expression profiles for different measurements is better captured by a correlation measure, such as Pearson correlation (James, Witten, Hastie, & Tibshirani, 2013), which is the most widely adopted similarity measure for practical applications of gene clustering. For this reason, in this activity we will use Pearson correlation instead of Euclidean distance. However, recall that Pearson is a similarity measure that ranges from −1 (lowest similarity) to +1 (highest similarity). After computing the 205 × 205 Pearson similarity matrix, you have to convert it to a dissimilarity matrix whose values range from 0 (lowest dissimilarity) to +1 (highest dissimilarity). Once you have this Pearson-based dissimilarity matrix, you can coerce it into type dist as required by the hierarchical clustering methods in the base R package stats .
14. Repeat the clustering analysis in Items 4 to 9 of Activity 1, now using the dissimilarity matrix for the YeastGalactose data computed in Item 13 (and, when applicable, the class labels that you stored separately in Item 12 to label observations as disposed along the horizontal axis of the relevant dendrograms).
This Nursing Assignment has been solved by our Nursing Experts at My Uni Paper. Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+ Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing style.
Be it a used or new solution, the quality of the work submitted by our assignment experts remains unhampered. You may continue to expect the same or even better quality with the used and new assignment solution files respectively. There’s one thing to be noticed that you could choose one between the two and acquire an HD either way. You could choose a new assignment solution file to get yourself an exclusive, plagiarism (with free Turnitin file), expert quality assignment or order an old solution file that was considered worthy of the highest distinction.
© Copyright 2026 My Uni Papers – Student Hustle Made Hassle Free. All rights reserved.