Тёмный

Remove Outliers - SQL Interview Query 11 | SQL Problem Level "MEDIUM" 

techTFQ
Подписаться 337 тыс.
Просмотров 9 тыс.
50% 1

Опубликовано:

 

29 окт 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 74   
@7vensandy_Data_Analyst
@7vensandy_Data_Analyst 7 месяцев назад
Since more than one outlier can mess up the average value. Would suggest to use median value in outlier detection and removal. Using a z Score method or a percentile method would be appropriate.
@camomile4085
@camomile4085 4 месяца назад
Could you share the answer, please?
@13.AkashDas
@13.AkashDas 2 месяца назад
my approach : with cte as( select *,avg(rating)over(partition by hotel )as avg from hotel_ratings), cte2 as (select *, abs(avg-rating) as abs from cte), cte3 as (select *, max(abs) over(partition by hotel) as exclude from cte2) select hotel,year,rating from cte3 where excludeabs order by hotel desc
@arnabsarkar6159
@arnabsarkar6159 7 месяцев назад
Not a full proof solution as one partition might have more than one distinct outliers..Not sure how rank function will help there
@camomile4085
@camomile4085 4 месяца назад
Thank you for your instruction. However, I noticed some redundancy in the query. Here is my refined solution based on your guidance: WITH cte1 AS ( SELECT *, AVG(rating) OVER (PARTITION BY hotel) AS r_avg FROM hotel_ratings), cte2 AS ( SELECT *, ABS (rating - r_avg) AS diff FROM cte1 ) SELECT hotel, year, rating FROM cte2 WHERE diff
@Ajaysaini081
@Ajaysaini081 7 месяцев назад
This is my alternate solution: select * from hotel_ratings where rating < (select avg(rating) from hotel_ratings where hotel='Radisson Blu') and hotel='Radisson Blu' union select * from hotel_ratings where rating > (select avg(rating) from hotel_ratings where hotel='InterContinental') and hotel='InterContinental' order by hotel desc
@NischitSapkota
@NischitSapkota 7 месяцев назад
For a large enough data set , there could be multiple values above the mean value. So, the data point above the mean doesn't necessarily mean the highest dataset. The average height of a male is 5'8 (let's say) , doesn't necessarily signify that 5'9 is the tallest a man could ever be. It could be 6'5 , 7'3 or whatever.
@Ajaysaini081
@Ajaysaini081 6 месяцев назад
Yes that is the issue with this solution
@splendidabhi
@splendidabhi 7 месяцев назад
Thank you so much learnt new things ❤
@parmoddhiman678
@parmoddhiman678 3 месяца назад
with cte as ( SELECT *, round(avg(rating) over(partition by hotel order by year range between unbounded preceding and unbounded following ),2) as avg_rating FROM hotel), cte1 as (select *, abs(avg_rating-rating) as diif from cte), cte2 as (select *, dense_rank() over(partition by hotel order by abs(avg_rating-rating) desc ) as rn from cte1) select hotel,year,rating from cte2 where rn>1
@gphanisrinivasful
@gphanisrinivasful 7 месяцев назад
This might be applicable only for MS SQL, (I am not sure about PostgreSQL) a constant average can be obtained simply by removing the order by clause. select avg(rating) over(partition by hotel) from hotel_ratings This code gives the required output. Though, it helps that avg() does not require an order by clause mandatorily. This wouldn't work for fuctions that require order by, like rank(), etc.
@yogamayaa
@yogamayaa 7 месяцев назад
He put Order by within partition to just maintain same order of records as final.
@vaibhavbatra9118
@vaibhavbatra9118 5 месяцев назад
with cte as (select hotel , year, rating, average_rating, average_rating_difference, max(average_rating_difference) over(partition by hotel order by year range between unbounded preceding and unbounded following) as max_difference from (select *, AVG(rating) over(partition by hotel order by year range between unbounded preceding and unbounded following) as average_rating, abs(rating - AVG(rating) over(partition by hotel order by year range between unbounded preceding and unbounded following)) as average_rating_difference from hotel_ratings)x ) select hotel, year, rating from cte where average_rating_difference max_difference
@RAISHABANU-d2k
@RAISHABANU-d2k 7 месяцев назад
sir, please put a video on different star pattern forming with mysql .
@anirbanbiswas7624
@anirbanbiswas7624 4 месяца назад
with cte as(select *,rank() over(order by rating asc)as max_rnk,rank() over(order by rating desc)as min_rnk from hotel_ratings) select hotel,year,rating from cte where max_rnk 8 and max_rnk1 and min_rnk 1 order by hotel donot know whether it will be dynamic or not...
@nileshpandey8247
@nileshpandey8247 7 месяцев назад
with cte as ( select *,avg(rating)over(partition by hotel ) as avg_rating,round(STDDEV(rating)over(partition by hotel),4) as std from hotel_ratings), cte1 as (select *,abs((rating - avg_rating)/std) as out_detect from cte) select hotel,year,rating from cte1 where out_detect < 1.5 order by hotel desc,year
@mayerKhadar
@mayerKhadar 7 месяцев назад
Perfect!
@TheRaju991
@TheRaju991 7 месяцев назад
why 1.5?
@nileshpandey8247
@nileshpandey8247 7 месяцев назад
@@TheRaju991 it's just bcz of this problem , else it would have been 3
@TheRaju991
@TheRaju991 7 месяцев назад
@@nileshpandey8247 Makes sense. Thanks!
@navaneeth6414
@navaneeth6414 7 месяцев назад
@TFQ Generalized solution that will work for many outliers WITH cte AS ( SELECT hotel,year,rating, CAST(AVG(rating) OVER(PARTITION BY hotel ORDER BY hotel) AS DECIMAL(9,1)) AS avg_rating, STDEV(rating) over(partition by hotel order by hotel) as std_dev FROM hotel_ratings ), cte1 AS ( SELECT * ,ABS(rating - avg_rating) AS diff_rating , cast(avg_rating + std_dev as decimal(9,1)) as positive , cast(avg_rating - std_dev as decimal(9,1)) as negative from cte ),cte2 as ( Select *, CASE WHEN rating =negative THEN 1 else 0 end as flag from cte1 ) select hotel,year,rating from cte2 where flag 0 order by hotel desc, year asc
@gouthamstar6558
@gouthamstar6558 6 месяцев назад
too complex, you can simplify this code
@navaneeth6414
@navaneeth6414 6 месяцев назад
Ok let me know your solution
@AjinkyaGhadge-w5l
@AjinkyaGhadge-w5l 3 месяца назад
@techTFQ I also face the same problme of range. However when I tried to order by hotel then it gave me the result as expected why does orderng by year create so much of difference ??
@DataSailor
@DataSailor 5 месяцев назад
SQL server - WITH CTE AS (SELECT *, ABS(rating - AVG(rating) OVER (PARTITION BY hotel)) as average_difference FROM hotel_ratings), CTE_FINAL AS (SELECT *,CASE WHEN average_difference = MAX(average_difference) OVER (PARTITION BY hotel) THEN 1 ELSE 0 END AS flag FROM CTE) SELECT hotel,year,rating FROM CTE_FINAL WHERE flag=0
@ratheeshg1896
@ratheeshg1896 4 месяца назад
Oracle SQL | Solution that will work for many outliers. WITH cte AS ( SELECT hotel, year, rating, AVG(rating) OVER (PARTITION BY hotel) AS average, rating - AVG(rating) OVER (PARTITION BY hotel) AS difference FROM hotel_ratings ), result_table AS ( SELECT hotel, year, rating, difference, AVG(ABS(difference)) OVER (PARTITION BY hotel) AS avg_abs_diff, CASE WHEN ABS(difference) - AVG(ABS(difference)) OVER (PARTITION BY hotel) >= AVG(ABS(difference)) OVER (PARTITION BY hotel) THEN 'remove' ELSE 'keep' END AS flag FROM cte ) SELECT hotel, year, rating FROM result_table WHERE flag = 'keep' ORDER BY hotel DESC, year;
@gouthamstar6558
@gouthamstar6558 6 месяцев назад
with cte as( select *, avg(rating) over(order by hotel) as rating2 from hotel_ratings ), cte2 as (select *, abs(rating-rating2) as did from cte), cte3 as (select *, rank() over(partition by hotel order by did desc) as a from cte2) select hotel,year,rating from cte3 where a>1 order by hotel desc ,year;
@vivekshirsat9249
@vivekshirsat9249 5 месяцев назад
WITH AVG_RATING_CTE AS (SELECT *, ROUND(AVG(RATING)OVER(PARTITION BY HOTEL ORDER BY YEAR RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),1) AS AVG_RATING FROM HOTEL_RATINGS) ,DIFF_CTE AS(SELECT HOTEL, YEAR, RATING, AVG_RATING ,ABS(AVG_RATING - RATING) AS DIFF FROM AVG_RATING_CTE) ,MAX_CTE AS( SELECT HOTEL, YEAR, RATING, DIFF ,MAX(DIFF)OVER(PARTITION BY HOTEL ORDER BY YEAR RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS MAX FROM DIFF_CTE ) SELECT HOTEL, YEAR, RATING FROM MAX_CTE WHERE DIFF MAX ORDER BY HOTEL,YEAR
@abhijitvernekar593
@abhijitvernekar593 7 месяцев назад
with cte as( select *,avg(rating) over(partition by hotel ) avg_rating,abs(rating - avg(rating) over(partition by hotel )) diff_rating from hotel_ratings), cte2 as (SELECT *,max(diff_rating) over(partition by hotel) max_diff from cte ) select hotel,year,rating from cte2 where diff_ratingmax_diff order by hotel desc, year ;
@sat9sh
@sat9sh 7 месяцев назад
Thank You Sir!
@MdZeeshan-m9u
@MdZeeshan-m9u 7 месяцев назад
Thank you so much
@hassamulhaq7762
@hassamulhaq7762 7 месяцев назад
thank you,
@Tusharchitrakar
@Tusharchitrakar 7 месяцев назад
Sorry to say but your approach to this problem is naive and as a data analyst/scientist you need to make sure your solution is statistically accurate. I'll tell you a major flaw with your solution with an example: If you have multiple outliers in each hotel your solution will not identify those outliers since you are only identifying 1 extreme outlier with the maximum deviation from the mean but this does not mean that other outliers cannot exist. Right solution: Apart from mean also calculate the standard deviation for each hotel (every rdbms has a function for that else can be done manually too). In statistics there is something called the 3 Sigma rule (if an assumption can be made that the data distribution is approximately gaussian and this is also called the Z score technique) which needs to be applied here. Essentially, the outliers will be those records whose values are mean (mu) +/- 3* standard deviation (Sigma). This logic has to be applied to obtain the outliers. If your goal is to just apply SQL logic with window functions it is okay but since you are talking about a business case scenario it is important to be statistically accurate That being said thank you for your continuous efforts in helping us build strong SQL foundations.
@gunasekhar8440
@gunasekhar8440 7 месяцев назад
Hi Tushar ji, simply we have two types of famous methods in stats. 1. Z-score 2. IQR Correct me if I'm wrong
@Tusharchitrakar
@Tusharchitrakar 7 месяцев назад
​​@@gunasekhar8440you are absolutely right. Iqr rule is applied if the data is not gaussian/normal. Iqr method can also be applied here: we need to calculate the 25th (Q1) and 75th percentile (Q3) of the data (there are window functions for this which will be different in different rdbms i guess). IQR=75th percentile value - 25th percentile value (Q3-Q1). Then outliers will be either Q1-1.5*IQR or Q3+1.5*IQR. This logic can also be used in SQL to identify outliers. Hope this helps
@rakeshkumarreddymudda
@rakeshkumarreddymudda 7 месяцев назад
Yes I also felt the same. Approach is not correct, ranking will not suit for resolving the outliers issue. We need use IQR to find outliers I think
@gunasekhar8440
@gunasekhar8440 7 месяцев назад
@@rakeshkumarreddymudda but we have only less number of records. Then if we do that procedure it doesn't make any sense
@Tusharchitrakar
@Tusharchitrakar 7 месяцев назад
@@gunasekhar8440 yes but in a general case iqr rule has to be applied. In these case there are only 4-6 records per hotel so the concept of percentile wouldn't be accurate for such low data points. But if the number of records were more, it would so you are right
@anudeepreddy5559
@anudeepreddy5559 Месяц назад
🔥❤
@digitaltechconnect1318
@digitaltechconnect1318 7 месяцев назад
Instead of ranking, is there any other logic to find outlier step at the last as step?
@akashchristopher
@akashchristopher 7 месяцев назад
yes. you can also find the outlier using 5 number summary method.
@ErfanHossainShoaib
@ErfanHossainShoaib 7 месяцев назад
you can find max avg value for hotel and get less value of max avg value
@ongole_boy6346
@ongole_boy6346 7 месяцев назад
Bro , can you please explain in oracle
@dumdum987
@dumdum987 3 месяца назад
with cte as( select hotel,avg(rating) as rat from hotel_ratings group by hotel) ,cte2 as(select h.hotel,h.year,rating,rank() over (partition by hotel order by abs(rating -rat) desc) as r from cte c inner join hotel_ratings h on c.hotel=h.hotel) select * from cte2 where r>1;
@chetanmaurya8557
@chetanmaurya8557 6 месяцев назад
i am getting avg rating of radisson blu as 3.83
@iswillia123
@iswillia123 7 месяцев назад
select hotel,year, rating, delta, max_delta from ( select hotel,year, rating, delta, max(delta) over (partition by hotel ) max_delta from ( select hotel,year, rating, abs(rating-av) as delta from ( select hotel, year, rating, round(avg(rating) over (partition by hotel),2) av from hotel_ratings))) where delta max_delta order by hotel desc, year ;
@SanjaySharma-zf4qw
@SanjaySharma-zf4qw 6 месяцев назад
select hotel,year,rating from hotel_ratings where hotel = 'Radisson Blu' and year in('2021','2022','2023') union select hotel,year,rating from hotel_ratings where hotel = 'InterContinental' and year in('2020','2021','2023') order by hotel desc
@gouthamstar6558
@gouthamstar6558 6 месяцев назад
this is a static way, not applicable for different cases. but what if the ratings changes? will you change the year names again? just think
@akshaysalve195
@akshaysalve195 2 месяца назад
WITH cte AS ( SELECT hr.*, ( SELECT ROUND(AVG(rating), 2) FROM hotel_ratings hr2 WHERE hr2.Hotel = hr.Hotel ) AS avrg FROM hotel_ratings hr ), cte1 AS ( SELECT Hotel, Year, rating, RANK() OVER ( PARTITION BY Hotel ORDER BY ABS(rating - avrg), Year ) AS rn FROM cte ) SELECT Hotel, Year, rating FROM cte1 WHERE rn < ( SELECT MAX(rn) FROM cte1 ) ORDER BY Hotel DESC, Year;
@SylviaFerguson-u8g
@SylviaFerguson-u8g 2 месяца назад
Thompson Edward Johnson Gary Johnson Karen
@navaneeth6414
@navaneeth6414 7 месяцев назад
My solution: Correct me if am wrong WITH cte AS ( SELECT hotel,year,rating, CAST(AVG(rating) OVER(PARTITION BY hotel ORDER BY hotel) AS DECIMAL(9,1)) AS avg_rating FROM hotel_ratings ), cte1 AS ( SELECT * ,ABS(rating - avg_rating) AS flag1 from cte ),cte2 as ( select *,max(flag1) over(partition by hotel order by hotel) as maxFlag from cte1 ) (select Hotel,year,rating from cte2 where flag1 maxFlag) ORDER BY hotel DESC
@sathvikdakappagari9122
@sathvikdakappagari9122 3 месяца назад
my solution : WITH rating_differences AS ( SELECT hotel,year,rating, ABS(rating - (SELECT ROUND(AVG(rating), 2) FROM hotel_ratings hr2 WHERE hr2.hotel = hr1.hotel)) AS rating_difference FROM hotel_ratings hr1), max_deviation AS ( SELECT hotel, MAX(rating_difference) AS max_rating_difference FROM rating_differences GROUP BY hotel) select fl.hotel, fl.year, fl.rating from(SELECT rd.hotel, rd.year, rd.rating, rd.rating_difference, CASE WHEN rd.rating_difference = md.max_rating_difference AND rd.hotel = md.hotel THEN 1 ELSE 0 END AS flag FROM rating_differences rd JOIN max_deviation md ON rd.hotel = md.hotel) fl where fl.flag != 1 thank you
Далее
Voy shetga man aralashay | Million jamoasi
00:56
Просмотров 248 тыс.
Ванька пошел!!!! 🥰
00:18
Просмотров 1,4 млн
6 SQL Joins you MUST know! (Animated + Practice)
9:47
Просмотров 182 тыс.
Voy shetga man aralashay | Million jamoasi
00:56
Просмотров 248 тыс.