A biological sample contains thousands of cells, each of which is unique and can be examined individually, cell by cell. They can be classified into clusters based on gene activity. But which genes are distinctive to a given cluster, i.e. what are its "marker genes"? The determination and analysis of these marker genes are aided by a new statistical method known as Association Plot.
Which genes are specific to a cell type and thus "mark" its identity? With the increasing size of datasets today, answering this question is frequently difficult. Marker genes are frequently simply genes found in specific cell populations. However, many more genes may be specific to a cell type but have yet to be discovered.
"Association Plots (APL)," a new statistical method for visualizing gene activity within a cell cluster, facilitates the identification of its marker genes. The plots compare the activity of genes in a given cluster to the activity of genes in all other clusters in the data set. They also make it simple to see which genes are shared by other clusters.
"Association Plots allow us to discover new marker genes. It also works the other way around: we can match clusters of unknown identity in a dataset to cell types using a list of marker genes provided "Elzbieta Gralinska of Berlin's Max Planck Institute for Molecular Genetics agrees.
The biotechnologist is part of Martin Vingron's team, which developed the technique, tested it on two publicly available datasets, and published the results. Furthermore, APL has been made available as a free module for the statistical environment R. The APL package allows researchers to visually inspect their single-cell data and use the cursor to select individual genes to learn more in-depth details.
Single-cell analysis and grouping
What is the point of identifying marker genes in the first place? Individual RNA molecules in individual cells can be deciphered using modern sequencing technologies. Each cell in a blood sample, for example, can be separated and a sample of the cell's RNAs decoded. These data from single cells represent active genes that were transcribed into RNA molecules.
The benefit is that instead of wondering which cell type a specific RNA belongs to, it can be traced back to its cell of origin. The disadvantage is that sequencing thousands of RNAs in each of tens of thousands of cells generates massive amounts of data.
One solution is to sort the cells according to their RNA content. "Single-cell data are made up of a diverse range of cell types. We're looking for cells of the same type, which should all behave similarly "Martin Vingron explains. As a result, he believes it makes sense to group similar cells computationally. "Marker genes define a cell type for us."
Interactively explore cell clusters
The team demonstrated how the new algorithm works using publicly available data from white blood cells. The various types of white blood cells, such as T-cells, B-cells, and monocytes, are organized into distinct clusters. The researchers confirmed known marker genes and demonstrated that close relatives among blood cells have a high degree of gene activity similarity.
Interactively investigate cell clusters
The team demonstrated how the new algorithm works by using publicly available data from white blood cells. T-cells, B-cells, and monocytes are all grouped together in separate clusters. The researchers confirmed known marker genes and demonstrated that close relatives among blood cells share a high degree of gene activity similarity.
In contrast, the new method allows her to visualize these genes, click on each one, and examine its activity in greater detail, she says. "We're not just providing lists of marker genes; we're also allowing users to examine how these genes function," the researcher explains. "They can dive into their data with Association Plots to learn more about each cell type." Furthermore, she claims that it is very simple to decipher the biological role of the most interesting genes in a subsequent step using Gene Ontology terms enrichment analysis, which is compatible with the APL software – a "very useful feature."
The mathematical model that underpins everything
High-dimensional data containing information on gene activity cannot be represented visually without information loss. The same holds true for clustered data, further complicating analysis. "Our trick is that we take into account many more dimensions than just two or three dimensions," Gralinska explains.
The Association Plots are derived from a mathematical technique that embeds both genes and cells in a common, high-dimensional space at the same time. Measuring the distances between genes and a given cell cluster in this space yields pairs of values that reflect a gene's association with a given cluster while also providing insights into its association with other clusters.
"One limitation of APL is that we rely on pre-clustered data, which means we have to rely on other clustering techniques," says Martin Vingron. "Nonetheless, we hope that our new method will attract a large number of new users. We have discovered that a visual and interactive process simply produces a better analysis."
Reference: DOI: 10.1016/j.jmb.2022.167525