Gene Set Enrichment Analysis (GSEA) - simply explained!

Подписаться 6 тыс.

Просмотров 26 тыс.

50% 1

What is GSEA and why is it one of the most popular pathway enrichment analysis methods? In this video, I will give you an overview of Gene Set Enrichment Analysis and how to use it to summarise your differential gene expression results.
We will go through the main concepts of GSEA to get a feeling of how it works and the differences with Over-Representation Analysis methods (ORA).
Hope you like it!
--------------------------------------------------------------------------------------------------------------------
Watched it already?
If you liked this video or found it useful, please let me know! Your comments and feedback are very much appreciated😊
If you have questions, don't hesitate to leave me a comment down below, I will answer as soon as I can:)
--------------------------------------------------------------------------------------------------------------------
Are you into biostatistics and computational analysis?
For more biostatistics tools and resources, you can visit:
biostatsquid.com/
Follow me on Instagram at @biostatsquid:
/ biostatsquid
For more
• simple and clear explanations of biostatistics methods
• computational biology tools
• easy step-by-step tutorials in R and Python
to analyse and visualise your biological data!
Don’t forget to subscribe if you don’t want to miss another video from me! --------------------------------------------------------------------------------------------------------------------
Other interesting resources for GSEA:
Main GSEA webpage: www.gsea-msigdb.org/gsea/inde...
More on the method itself: www.pnas.org/doi/10.1073/pnas...
Paper comparing pathway enrichment analysis methods: genomebiology.biomedcentral.c...

Опубликовано:

14 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 50

@mocabeentrill Год назад

Hi Biostatsquid. This is the most straight forward explanation on GSEA i've heard. Thank you for your hard work.

@MeWatchingYouTubeVideos 11 месяцев назад

Thank you so much! Perfect for beginners to quickly grasp it!

@genuinity Год назад

Thank so much for both videos, such clear and concise explanation, please continue making videos.🙃

@apedike Год назад

So glad I discovered this channel! Looking forward to all these videos.

@biostatsquid Год назад

Thank you! Glad you enjoyed it:)

@mercedesdebernardi4215 29 дней назад

Tus videos me estan ayudando muchisimo!!! Sigue asi!!

@enraegen561 Год назад

I thoroughly enjoy the illustrations. Thank you! :D

@jfromtheusa 11 месяцев назад

wow this is such a clear description!

@user-if2di8gh1o 11 месяцев назад

such an awesome video. informative and clear to follow. thank you so much

@jehadyasin04 Год назад

Truly amazing videos!

@pygmypuffdraws2753 5 месяцев назад

That was super helpful, thank you so much!

@cintiapalu1929 2 месяца назад

Amazing, I will definitely recommend to my colleagues - thanks for such a nice work

@nanditasatish2297 Год назад

love this channel

@amrsalaheldinabdallahhammo663 Год назад

Simply genius :) ... Keep on making videos and entertain us

@simrangambhir782 Год назад

thank you very much for the explanation😃

@duupu8417 Год назад

So helpful. Thanks a lot.

@juliangrandvallet5359 Год назад

AMAZING!!!

@swifttaylor3107 9 месяцев назад

YOU UNDERSTANDED ME THANK YOU

@ZullyPulido Месяц назад

Eres la mejor!! Saludos desde Colombia :)

@jgk9111 9 месяцев назад

The best video

@NAVYAB-eb2jp Месяц назад

Thank you for explaining it well.. Can you pls provide information on the inputs needed to perform ssGSEA ...

@danielgladish2502 2 месяца назад

Great video! Really helpful for getting an understanding of the analysis workflow! A small critique / suggestion for improvement that I think could be made is in terminology being used, specifically referring to genes in the ranked list as being overrepresented. As you said in the video, one is not filtering any genes, so when looking at your gene set in GSEA, you aren't looking at the proportion of the genes being part of your list, but rather where are the genes located in the unfiltered ranked list containing all the genes.

@biostatsquid 2 месяца назад

Totally agree! Thanks for your comment:)

@sanjaisrao484 Год назад

Thanks mam, mam upload GSEA analysis in R, please

@karoljacek8858 Год назад

Great material! Do you know of any topology-based methods that works on single-cell datasets (or pseudo-bulk single cell data)?

@goodoo6745 Год назад

I love the way you explain the whole concept in simple terms. could you elaborate more on how to rank the gene list from the FC and Pvalues of the differential expression? I a trying to make the rnk file to be imported to.GSEA

@biostatsquid Год назад

Thanks for your comment! That is a great question, I think many people will have the same issue. I am working on a GSEA tutorial which will show you exactly how to do it but consider this an advancement on the full script!:P I work with the package fgsea bioconductor.org/packages/release/bioc/html/fgsea.html You can read the documentation for more detailed instructions and examples, but for example, if you want to use the sign of log2FC multiplied by the -log10(pval) as ranking to order your gene list, you can do something like: rankings

@Bee-zp5vo 10 месяцев назад

Hi mam, Could you make a video new generation tool "topology based method " for pathway enrichment analysis which you mentioned in this video @7:26

@arfaarashid Год назад

Hi Biostatsquid, thanks for the video! I had a question about the amount of genes that these analyses are performed on. In a workshop I did performing functional analysis, my input contained around 20,000 genes. Is this normal for GSEA? Or should the input size be around 20 or 100? Thanks again

@biostatsquid Год назад

Hi Arfaa! Great question. 20,000 genes sounds more than fine for GSEA. Actually GSEA makes more sense with many input genes, more than just 20 (in that case it wouldn't take that long to research what each gene does)

@mihacerne7313 Год назад

Squidtastic!

@anmolpardeshi3138 Год назад

Hi. thanks for the amazing depiction! I was wondering if you can clear out the "permutation" step used in GSEA or FCS analysis. Thanks.

@biostatsquid Год назад

Hi Anmol, thank you so much for your comment. As for your question in gene permutation steps, I think the best explanation is the given by Anthony Castanza in this discussion: The gene_set permutation mode, which we acknowledge is inferior to the phenotype permutation mode, tests gene sets on the basis of how likely it is that a random gene set of a given size was to be enriched within the given dataset. The results from this distribution of random enrichment scores calculated as a result of sampling random gene sets that would be the same size as the set of interest, are then compared to the true enrichment score of the identically sized real set to determine if the observed enrichment is more extreme than would be expected if the true set, like the random sets, had no functional connection to a given process. In this permutation mode, GSEA constructs a "null" distribution of sets that are random and therefore are assumed to have no coordinated biological function, therefore the null hypothesis would be that the given real set has no coordinated biological function within the data, an enrichment more extreme than that observed in the null distribution (sets that we "know" are random and have no coordinated biological function) would allow us to reject the null hypothesis and say that the set does have a coordinated function at [pValue] level of probability. groups.google.com/g/gsea-help/c/dveYVGQGMS0/m/l5l2sli6CwAJ? Hope this helped!

@davidguardamino Год назад

Hi! Great video. I have seen that it is very popular to use the foldChange to rank the genes... so here, when using FC*-log(p-value) , is it a convention? (Sorry if my question is very odd, I am new in this)

@biostatsquid Год назад

Hi David. Not at all, that is a great question! So it depends on what you have. If you rank all genes, you include also genes with a very high p-value (for example, gene X with p-val = 0.8). So yeah, perhaps your gene X has an amazing fold change meaning there is a big difference between the two groups you are comparing, but with a p-val of 0.8, that big change is just not significant. So using sign(FC) * -log(pval) is a way of taking this into account. -log(p-val) will transform those p-values (going from 0 to 1) into a more manageable scale (basically instead of pval 0.00000000000000001 you have a -logpval of 17). The sign(FC) just transforms that manageable number into positive (if upregulated, or FC > 0) or negative (if downregulated, so FC < 0). This way, you genes will be ranked from downregulated, SIGNIFICANT genes -> downregulated, less significant genes --> non-significant genes ----> upregulated, more significant genes ----> upregulated and significant genes. Of course, you can also pre-filter your genes to only include significant ones (e.g., using pval < 0.01 or 0.05), and then just sort them by FC without worrying about the significance. Does this make sense? Hopefully this helped. Thanks for the question!

@rishikeshlotke 8 месяцев назад

@@biostatsquid Hi, I tried to work with the formula you present at 3:49 for the gene ALDOB from your table. From my calculation based on your formula, the rank for ALDOB comes to -27.1066. In your orange ranked table at 3:49, I see the ranking is done by just using -log10(pval) but in the next slide at 3:51 ALDOB has a positive ranked value of 11.3. Could you explain what I am doing wrong or missing here? Also, does it make any sense to use adjusted p-values (FDR) instead of regular p-values for such a ranking calculation? Why or why not? Thanks for your clarification in advance.

@svitlanatretyak4438 5 месяцев назад

Thanks for the info! Really helpful 🙌🏻 In my experiment multiple conditions were tested and I used multiple comparison tests. Thus, I have no the FC value. Can I simply use the results of F-statistics (or p val/p val adj) for my list of genes to perform GSEA)? Did you ever have this problem? Thanks in advance!

@eloisadalsin2300 Месяц назад

@Scientific_Updates Год назад

Dear BioStatquid, Thanks for the video, your explanation is really nice. I need to ask that few online platform for performing GSEA require organisms database e.g. Broad Institute GSEA. and it does not contain database for bacterial genome, I have RNASeq data that I need to perform GSEA but unable to perform it, because of unavailability of database in input format. Please suggest. Thanks in Advance

@biostatsquid Год назад

Hi! Thanks for your feedback. I have not really worked with prokaryotes, but FUNAGE-Pro could be a possible solution - 'comprehensive web server for gene set enrichment analysis of prokaryotes' pubmed.ncbi.nlm.nih.gov/35641095/ funagepro.molgenrug.nl/ Hope it works!

@Scientific_Updates Год назад

@@biostatsquid Thanks for your response, I hae performed analysis through FunagePRO, but its functional enrichment analysis in my case didn't work. Trying cluster profiler, and Goseq but all need an org database which I don't have.

@pabloaguirreazorin8324 2 месяца назад

Hi Biostatsquid. What do you use to get the ranked list: p-value or adjusted p-value? If it is p-value, Why?

@biostatsquid 2 месяца назад

Hi! Thanks for your comment:) I normally use -log10(p-adj) * sign(log2FC), maybe this will help: www.biostars.org/p/375584/ www.biostars.org/p/298312/

@funnyarian 3 месяца назад

Squidtastic!! How accurate it is to say that in the ranked list at the top we have the most upregulated and at the bottom the most downregulated (as you said in the video and image)? Because I would change into - at the top we have the most significant upregulated, and at the bottom the most significant and downregulated. Again maybe one the most significant (by pval/padj) is the most significant but it is not the most upregulated/downregulated?

@biostatsquid 3 месяца назад

Hi! Great point. If you rank them by sign(-log2FC)*p-val it's exactly what you said: you'd be ranking them from most significant & upregulated > less significant upregulated > less significant downregulated > most significant downregulated. Does this make sense? And yes, exactly, maybe the one with the highest sign(-log2FC)*p-val , is not the most upregulated, but rather the most significant:)

@jayashreelaxmekuppuswami8600 Год назад

How does KS test answer the question of whether the ranked list is random or not? Isn't that a test of normality of distribution? How can it inform us about randomness or non randomness of a ranked list?? Pls explain

@biostatsquid Год назад

Hi Jayashree, thank you for your question, I will elaborate a bit more than in the video. The KS test checks whether two samples follow the same distribution. It has many uses, for example, as you mention, to test for normality. In this case, however, we use it to check whether the distribution of genes from a certain pathway across the ranked list follows a random distribution or not. So for example, we check the distribution of genes related to 'ATP synthesis' in our ranked list (sorting genes by most to least upregulated). If most of the genes involved in ATP synthesis are upregulated in one condition, they will be located at the top of the list, so the distribution across our ranked list is clearly not random. Aka they don't follow a random distribution. Therefore, we conclude that ATP synthesis is a differential pathway between our two conditions. The KS test will sort out the statistics for us, giving us p-values to help us decide when a pathways is statistically significant for our comparison. Hope this was a bit clearer!

@jennyhu5011 8 месяцев назад

what does the list of background genes do?

@biostatsquid 8 месяцев назад

Hi Jenny, thanks so much for your question - I don't think I mentioned it in this video, so sorry for the confusion! In GSEA, we just need a list of all the genes we're interested in, and a list of gene sets. The background genes are used to filter out the genes that were not measured in our experiment from the gene sets, to avoid bias. E.g., if you download cancer hallmark gene sets, some pathways may contain genes that were not measured in your experiment for whatever reason (e.g., if you have liver samples, brain-related genes may be very downregulated or not expressed). So we must remove all those genes from the gene set list we use for our analysis. Hopefully this made sense! You can read more about it in my PEA blogpost/I think I also explain it in the PEA video:)