Dealing with Missing Data in R

Подписаться 19 тыс.

Просмотров 5 тыс.

50% 1

Data imputation is a technique that allows missing data to be replaced with data without affecting the trend of the analysis. It can be done in a huge numbers of ways. In R there's a lot of package that could allow the imputation of data easily as long as you understand the method you desire and why you are running on such method. IN this video I want to show case how you can use the mice package to easily replace data in a matrix and how you can compare the performance of each algorithm using ggplot2.
Slides
docs.google.com/presentation/...
Github
github.com/brandonyph/Imputat...
Email: liquidbrain.r@gmail.com
Website: www.liquidbrain.org/videos
Patreon: / liquidbrain
Chapters
0:00 Introduction
1:05 What's imputation
1:45 Types of missing data
3:22 Measuring success
3:55 A number of different imputation techniques
9:05 R Script: introduction of the rmd format
10:06 Mean Imputation
11:40 locf and nocb
14:36 kNN and kNN imputation
19:00 Advance imputation with mice()
23:00 How does pmm and rf performed?
25:07 TCGA data Imputation
30:13 Effectiveness of Imputation

Опубликовано:

15 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 7

@mangalahegde3805 2 года назад

Woow.. This is wonderful.. Thank you for creating and sharing informative videos

@haraldurkarlsson1147 7 дней назад

It would be nice to know where some of the functions you are using are coming from (without having to visit github). I cannot find locf, nobc or forbak in nomemica. I checked the zoo package. It does not have those but similar ones (na.locf for both LOCF and NOBC).

@Philantrope 4 месяца назад

Thanks for this thorough demonstration! I wonder what you think about what percentage of missing values is okay to do imputation. Also the number of available complete cases might be important. E.g. if I have 3.000 complete cases is it okay to impute 12.000 missing values in the other cases? Information on these considerations are rarely to be found.

@haraldurkarlsson1147 7 дней назад

Nice presentation. However, I find difficult to find a good account of the difference between the different classes of missings (MCAR, MAR, MNAR). After reading the description of these types of classes by different youtubers I am just left a loss. Perhaps no one can explain these things?

@warmtaylor 10 месяцев назад

Thank you for your informative video!// At 15:03, I was wondering if you could provide me with reason(s) as to why data need to be normalised first before applying the KNN imputation. What would be consequence(s) if actual values are used for KNN imputation directly?// Are there quantitative method(s) which could be used to assess the accuracy of the imputation rather than visualisation? My data contains more than three thousand rows, so it is hard to assess the accuracy by using the three types of plotting described in the video.

@haraldurkarlsson1147 7 дней назад

I beliveve that if you have variables with different ranges (say 0 to 1) and (0 to 100) then you need to scale or normalized them before running kNN or one variable might dominate the other.