Since more than one outlier can mess up the average value. Would suggest to use median value in outlier detection and removal. Using a z Score method or a percentile method would be appropriate.
my approach : with cte as( select *,avg(rating)over(partition by hotel )as avg from hotel_ratings), cte2 as (select *, abs(avg-rating) as abs from cte), cte3 as (select *, max(abs) over(partition by hotel) as exclude from cte2) select hotel,year,rating from cte3 where excludeabs order by hotel desc
Thank you for your instruction. However, I noticed some redundancy in the query. Here is my refined solution based on your guidance: WITH cte1 AS ( SELECT *, AVG(rating) OVER (PARTITION BY hotel) AS r_avg FROM hotel_ratings), cte2 AS ( SELECT *, ABS (rating - r_avg) AS diff FROM cte1 ) SELECT hotel, year, rating FROM cte2 WHERE diff
This is my alternate solution: select * from hotel_ratings where rating < (select avg(rating) from hotel_ratings where hotel='Radisson Blu') and hotel='Radisson Blu' union select * from hotel_ratings where rating > (select avg(rating) from hotel_ratings where hotel='InterContinental') and hotel='InterContinental' order by hotel desc
For a large enough data set , there could be multiple values above the mean value. So, the data point above the mean doesn't necessarily mean the highest dataset. The average height of a male is 5'8 (let's say) , doesn't necessarily signify that 5'9 is the tallest a man could ever be. It could be 6'5 , 7'3 or whatever.
with cte as ( SELECT *, round(avg(rating) over(partition by hotel order by year range between unbounded preceding and unbounded following ),2) as avg_rating FROM hotel), cte1 as (select *, abs(avg_rating-rating) as diif from cte), cte2 as (select *, dense_rank() over(partition by hotel order by abs(avg_rating-rating) desc ) as rn from cte1) select hotel,year,rating from cte2 where rn>1
This might be applicable only for MS SQL, (I am not sure about PostgreSQL) a constant average can be obtained simply by removing the order by clause. select avg(rating) over(partition by hotel) from hotel_ratings This code gives the required output. Though, it helps that avg() does not require an order by clause mandatorily. This wouldn't work for fuctions that require order by, like rank(), etc.
with cte as (select hotel , year, rating, average_rating, average_rating_difference, max(average_rating_difference) over(partition by hotel order by year range between unbounded preceding and unbounded following) as max_difference from (select *, AVG(rating) over(partition by hotel order by year range between unbounded preceding and unbounded following) as average_rating, abs(rating - AVG(rating) over(partition by hotel order by year range between unbounded preceding and unbounded following)) as average_rating_difference from hotel_ratings)x ) select hotel, year, rating from cte where average_rating_difference max_difference
with cte as(select *,rank() over(order by rating asc)as max_rnk,rank() over(order by rating desc)as min_rnk from hotel_ratings) select hotel,year,rating from cte where max_rnk 8 and max_rnk1 and min_rnk 1 order by hotel donot know whether it will be dynamic or not...
with cte as ( select *,avg(rating)over(partition by hotel ) as avg_rating,round(STDDEV(rating)over(partition by hotel),4) as std from hotel_ratings), cte1 as (select *,abs((rating - avg_rating)/std) as out_detect from cte) select hotel,year,rating from cte1 where out_detect < 1.5 order by hotel desc,year
@TFQ Generalized solution that will work for many outliers WITH cte AS ( SELECT hotel,year,rating, CAST(AVG(rating) OVER(PARTITION BY hotel ORDER BY hotel) AS DECIMAL(9,1)) AS avg_rating, STDEV(rating) over(partition by hotel order by hotel) as std_dev FROM hotel_ratings ), cte1 AS ( SELECT * ,ABS(rating - avg_rating) AS diff_rating , cast(avg_rating + std_dev as decimal(9,1)) as positive , cast(avg_rating - std_dev as decimal(9,1)) as negative from cte ),cte2 as ( Select *, CASE WHEN rating =negative THEN 1 else 0 end as flag from cte1 ) select hotel,year,rating from cte2 where flag 0 order by hotel desc, year asc
@techTFQ I also face the same problme of range. However when I tried to order by hotel then it gave me the result as expected why does orderng by year create so much of difference ??
SQL server - WITH CTE AS (SELECT *, ABS(rating - AVG(rating) OVER (PARTITION BY hotel)) as average_difference FROM hotel_ratings), CTE_FINAL AS (SELECT *,CASE WHEN average_difference = MAX(average_difference) OVER (PARTITION BY hotel) THEN 1 ELSE 0 END AS flag FROM CTE) SELECT hotel,year,rating FROM CTE_FINAL WHERE flag=0
Oracle SQL | Solution that will work for many outliers. WITH cte AS ( SELECT hotel, year, rating, AVG(rating) OVER (PARTITION BY hotel) AS average, rating - AVG(rating) OVER (PARTITION BY hotel) AS difference FROM hotel_ratings ), result_table AS ( SELECT hotel, year, rating, difference, AVG(ABS(difference)) OVER (PARTITION BY hotel) AS avg_abs_diff, CASE WHEN ABS(difference) - AVG(ABS(difference)) OVER (PARTITION BY hotel) >= AVG(ABS(difference)) OVER (PARTITION BY hotel) THEN 'remove' ELSE 'keep' END AS flag FROM cte ) SELECT hotel, year, rating FROM result_table WHERE flag = 'keep' ORDER BY hotel DESC, year;
with cte as( select *, avg(rating) over(order by hotel) as rating2 from hotel_ratings ), cte2 as (select *, abs(rating-rating2) as did from cte), cte3 as (select *, rank() over(partition by hotel order by did desc) as a from cte2) select hotel,year,rating from cte3 where a>1 order by hotel desc ,year;
WITH AVG_RATING_CTE AS (SELECT *, ROUND(AVG(RATING)OVER(PARTITION BY HOTEL ORDER BY YEAR RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),1) AS AVG_RATING FROM HOTEL_RATINGS) ,DIFF_CTE AS(SELECT HOTEL, YEAR, RATING, AVG_RATING ,ABS(AVG_RATING - RATING) AS DIFF FROM AVG_RATING_CTE) ,MAX_CTE AS( SELECT HOTEL, YEAR, RATING, DIFF ,MAX(DIFF)OVER(PARTITION BY HOTEL ORDER BY YEAR RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS MAX FROM DIFF_CTE ) SELECT HOTEL, YEAR, RATING FROM MAX_CTE WHERE DIFF MAX ORDER BY HOTEL,YEAR
with cte as( select *,avg(rating) over(partition by hotel ) avg_rating,abs(rating - avg(rating) over(partition by hotel )) diff_rating from hotel_ratings), cte2 as (SELECT *,max(diff_rating) over(partition by hotel) max_diff from cte ) select hotel,year,rating from cte2 where diff_ratingmax_diff order by hotel desc, year ;
Sorry to say but your approach to this problem is naive and as a data analyst/scientist you need to make sure your solution is statistically accurate. I'll tell you a major flaw with your solution with an example: If you have multiple outliers in each hotel your solution will not identify those outliers since you are only identifying 1 extreme outlier with the maximum deviation from the mean but this does not mean that other outliers cannot exist. Right solution: Apart from mean also calculate the standard deviation for each hotel (every rdbms has a function for that else can be done manually too). In statistics there is something called the 3 Sigma rule (if an assumption can be made that the data distribution is approximately gaussian and this is also called the Z score technique) which needs to be applied here. Essentially, the outliers will be those records whose values are mean (mu) +/- 3* standard deviation (Sigma). This logic has to be applied to obtain the outliers. If your goal is to just apply SQL logic with window functions it is okay but since you are talking about a business case scenario it is important to be statistically accurate That being said thank you for your continuous efforts in helping us build strong SQL foundations.
@@gunasekhar8440you are absolutely right. Iqr rule is applied if the data is not gaussian/normal. Iqr method can also be applied here: we need to calculate the 25th (Q1) and 75th percentile (Q3) of the data (there are window functions for this which will be different in different rdbms i guess). IQR=75th percentile value - 25th percentile value (Q3-Q1). Then outliers will be either Q1-1.5*IQR or Q3+1.5*IQR. This logic can also be used in SQL to identify outliers. Hope this helps
@@gunasekhar8440 yes but in a general case iqr rule has to be applied. In these case there are only 4-6 records per hotel so the concept of percentile wouldn't be accurate for such low data points. But if the number of records were more, it would so you are right
with cte as( select hotel,avg(rating) as rat from hotel_ratings group by hotel) ,cte2 as(select h.hotel,h.year,rating,rank() over (partition by hotel order by abs(rating -rat) desc) as r from cte c inner join hotel_ratings h on c.hotel=h.hotel) select * from cte2 where r>1;
select hotel,year, rating, delta, max_delta from ( select hotel,year, rating, delta, max(delta) over (partition by hotel ) max_delta from ( select hotel,year, rating, abs(rating-av) as delta from ( select hotel, year, rating, round(avg(rating) over (partition by hotel),2) av from hotel_ratings))) where delta max_delta order by hotel desc, year ;
select hotel,year,rating from hotel_ratings where hotel = 'Radisson Blu' and year in('2021','2022','2023') union select hotel,year,rating from hotel_ratings where hotel = 'InterContinental' and year in('2020','2021','2023') order by hotel desc
WITH cte AS ( SELECT hr.*, ( SELECT ROUND(AVG(rating), 2) FROM hotel_ratings hr2 WHERE hr2.Hotel = hr.Hotel ) AS avrg FROM hotel_ratings hr ), cte1 AS ( SELECT Hotel, Year, rating, RANK() OVER ( PARTITION BY Hotel ORDER BY ABS(rating - avrg), Year ) AS rn FROM cte ) SELECT Hotel, Year, rating FROM cte1 WHERE rn < ( SELECT MAX(rn) FROM cte1 ) ORDER BY Hotel DESC, Year;
My solution: Correct me if am wrong WITH cte AS ( SELECT hotel,year,rating, CAST(AVG(rating) OVER(PARTITION BY hotel ORDER BY hotel) AS DECIMAL(9,1)) AS avg_rating FROM hotel_ratings ), cte1 AS ( SELECT * ,ABS(rating - avg_rating) AS flag1 from cte ),cte2 as ( select *,max(flag1) over(partition by hotel order by hotel) as maxFlag from cte1 ) (select Hotel,year,rating from cte2 where flag1 maxFlag) ORDER BY hotel DESC
my solution : WITH rating_differences AS ( SELECT hotel,year,rating, ABS(rating - (SELECT ROUND(AVG(rating), 2) FROM hotel_ratings hr2 WHERE hr2.hotel = hr1.hotel)) AS rating_difference FROM hotel_ratings hr1), max_deviation AS ( SELECT hotel, MAX(rating_difference) AS max_rating_difference FROM rating_differences GROUP BY hotel) select fl.hotel, fl.year, fl.rating from(SELECT rd.hotel, rd.year, rd.rating, rd.rating_difference, CASE WHEN rd.rating_difference = md.max_rating_difference AND rd.hotel = md.hotel THEN 1 ELSE 0 END AS flag FROM rating_differences rd JOIN max_deviation md ON rd.hotel = md.hotel) fl where fl.flag != 1 thank you