Show Menu

Data Analysis in Genomics and Transdriptomics Cheat Sheet (DRAFT) by

Unit 2: Genome analysis Unit 4: Metagenome and metatranscriptome analysis Unit 5: Comparative genomics

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Self-O­rga­nizing Map (SOM)

- create a 1D/2D lattice of artificial neurons
- define no of neurons in each dimension
- assign random weights in same dim as input
- select random data point from input
- choose winning neuron based on similarity with input
- update weights of winning and neighb­oring neurons (based on learning rate & neighb­orhood function)
- repeat for all data points until conver­gence

Principle Component Analysis (PCA)

- obtain distance matrix
- construct matric matrix
- compute eigen values and eigen vectors of matric matrix
- compute cartesian coordi­nates
PCA focuses on maximizing variance, and compro­mises resolution of proximal clusters.
A distance matrix doesn't reveal the underlying dimens­ion­ality of the space in which these points exist.

t-dist­ributed Stochastic Neighbor Embedding (tSNE)

- Similarity score for all points against all points are computed.
- The points are then randomly placed on 2- or 3-dime­nsional space.
- Using an optimi­zation method, points are moved step by step based on the similarity score until conver­gence is achieved
Retains resolution for close clusters, while scaling the farther clusters to fit in frame.
Used in combin­ation with PCA.

K-Means Clustering

- choose no of clusters (k)
- randomly initialise k cluster centroids
- calculate distance between each data point and each centroid (euclidean or manhattan distance)
- assign data point to cluster whose centroid is closest
- recalc­ulate centroid by taking mean of all data points assigned to the cluster
- iterate until stopping criteria met:
1. max no of iterations reached
2. centroids no longer change signif­icantly