@13:11 Everything was going, you even mentioned security issue of using counter making the url guessable and then you said, we use 10-15 bit to add randomness. Now we cannot add bit that will make this number a bigger number because zookeper might reach that range when it starting from 1million. 2nd, lets say, we add random character to it, then the base62 encoding wont be 7 char long and then you have to take first 7 chars which WILL result in collision sometime in future.
@@ericfries7229 Could you elaborate? From what I understand, if we concatenate the number with the random bits, then the resulting string would be more than 7 chars. Say the new length is 10, base62 encoding would give another 10 char string, which can still result in collision, no?
If we add 10-15 bits at the end of counter number then it will increase the base62 output size and exceed the 7 character limit. So can you clarify how that addition of random string works?
TLDR: md5(“123”) vs md5(“123,xyz”) The original solution hashed the returned counter value, which is a simple increasing number. The resulting hash is then base62 encoded. The new solution takes the counter number and appends some extra characters at the end. This new string is then hashed and base62 encoded. In this way, the string sent into hash is not guessable. This works because the numeric portion is still guaranteed not to collide.
Even without appending extra digits, base62 does not guarantees 7 chars. That simply means, we are not confining ourselves to 7 chars. We probably need to tell the interviewer that we would get as small as possible hash but no guarantees. This is a possible tradeoff to avoid collision.
is there a way to replace zookeeper? why not just store a record of ranges used in a db or redis cache. since the operation of generating a new range happens once in a while. specifically, when a new server is being initialized and another when the range of a server rans out and need to fetch a new one
LRU cache eviction has a big short coming. If we assume the cache is always at capacity with each new shortURL creation, then that means one of the top 20% URL will surely get evicted from cache, to make room for a random unpopular URL someone has submitted. This eviction approach means there will be a steady state of URLs in the cache that are not popular at all.
does Base62Encoding of counter guarantee to return only 6 characters or 7 characters string ? if this is the case then when we do Base62Encoding on MD5 hashed value, in that case also it should return 6 or 7 characters of string...…please clear this doubt
Believe me brother, none of the guys were able to clearly explain this on RU-vid/Books/Blogs. Base62 does not guarantee in terms of result length. It generally increases in this case though.
if we have to base60 the #, then why not precompute it and distribute it to servers, i.e. S1 will keep the hash60 of 0-1M , S2 will keep the hash from 1M-2M and so on.
Concerning the round robin approach to load balancing not taking server load into consideration, such that it might still forward request to an overloaded or slow server, I believe that due to the style of this approach, if one server is overloaded, then all the servers are possibly overloaded else seeing that it would have been distributing the loads equally except if the servers are not of equal capacity which will then bring the question of why will you spin up a server of lesser capacity
If you are using a unique counter why do you still need md5 and base62 encoding.? And then you need additional complexity like integrate zoo keeper etc just to maintain counters.
Hi. Good video. I have a question though. If we do a base 62 of smaller numbers like 0, 1, 2 ... we wont be getting a 7 char long tiny URL right? So how do we make sure all our short URL's are 7 char long? Kindly clarify
@@p516entropy : Hi. Thanks for responding. But now consider a bigger number , say 1000000, it's base 62 -> 11PVWGSpX6 which is greater than 7 characters. I am still unable to understand how to fix your short url using counter method to exactly 7 characters
@@ashwinnatty Oh sorry I was hurry and now I understood, here is base62 algorithm has the same approach as base2 (000, 001, 010, 011, 100, 101) or base16 (0000... FFFF), but now that is base62. (0 000000 , 1 0000001 , 2 0000002 , 3 0000003, 4 0000004, 59 000000X , 60 000000Y , 61 000000Z , 62 0000010 , 63 0000011 , 64 0000012 , 65 0000013, 56_800_235_583 0ZZZZZZ , 56_800_235_584 1000000, 3_521_614_606_207 ZZZZZZZ). And indeed, as you can see no any collisions
@@p516entropy sorry, still don't understand this. base62(101020)=FMA4abRo and base62(101021)=FMA4abRp. If we take the first 7 chars from each, we get a collision, "FMA4abR"
Why not just rely on a database cluster to generate a sequencial numeric id , and just persist the id and the long url ? So when a request comes in, we convert from base62 to decimal and find the record in the db. What is wrong of this idea ?
You mentioned zookeeper. what is the number of servers is not stable, could be 8 today, then 5 tomorrow, then 10 next tomorrow. How does zookeeper manage those ranges based on thee number of servers available. I really need to understand this. Please help
why can't the database (mongodb) itself generate an auto increment ID and then server can generate MD5 or base62 after adding salt to it ? Since db upon creation would never fail to generate a unique primary key it would avoid single point of faliure. Making a distributed solution for generating ID seems overkill here.. I am new to this so let me know if there is a problem in this solution.
Thanks for the excellent video. A question, in solution 2 with zookeeper, how does each server records the range it was given and when the range exhausted? Is there a counter on each server?
That logic can reside on the zookeeper service itself. For instance, the coordination service can use a custom hashing function based on the incoming request and forward it to specific servers depending on the range value.
For caching, will it be better to have some kind of background job that populates the cache with the most popular URLs from the database? Else if you're always adding a URL to the cache, then it's no longer just the popular URLs but all of them (or at least the capacity of the cache).
How does your hashing work o.O - you take an 128 bit md5, and then make a base62 out of it with 20+ chars. How are you sure there are not collisions when only taking a subset of the base62 aka md5 hash(just differently displayed)? o.O
question about using counters - isn't that mean that if the same user asks to shorten the same URL multiple times, he will get multiple short URLs? Is it ok from a requirement standpoint?
Whenever you get a request to shorten URL 1st check should be to see if it already exists either in cache or main DB. If no, then only proceed with the short URL creation. Or you could use the insert "IF NOT EXISTS" feature provided by no sql solution like cassandra.
The counter example is incorrect - Another solution could be to append the user id (which should be unique) to the input URL. However, if the user has not signed in, we would have to ask the user to choose a uniqueness key. Even after this, if we have a conflict, we have to keep generating a key until we get a unique one.
it isn't incorrect because the counter itself will be unique everytime even if the random bits will be same. And so does the generated hash will be unique for every url.
With number that is less than 3.5 trillion, it would always result in a unique base 62 string, but the issue here is,, it would be guessable, because if 1 million -> xyz001 and 1million 1 would be xyz002. In order to tackle that, we could add random characters to the number which will result in base 62 string with len > 7 and now it is an issue.
Zookeeper provides increased availability through redundancy by operating in a multi-node cluster configuration. Multiple zookeeper nodes would have to go down for service interruption. As the URL shorting service reserves a range of values rather than a single value, the number of roundtrips to this counter reservation system is also reduced, which in turn also helps to reduce system load, network traffic, and the number of services involved in handling each reservation. These all help to increase availability, and can also help to increase performance and reduce running costs. You can increase availability further by spanning your nodes and services across separate datacenters, to avoid your infrastructure becoming a single point of failure.
Well the lecture is great i have a doubt, MD5 is generating unique hash each time so we can grab first 7 char of that hash, then why we need to convert it in base62.?
ok, got it. MD5 generates 128 bit output which is generally represented in hexadecimal, if you will take first 7 char of this string made up of hexadecimal you will not be able to make many combinations. only 16^7 combinations. However if you use base64 encoded string we you will able to have many combinations in 7 char i.e 64^7
Let's say the requirements change and we want a unique_url every time the same long_url is entered. (Use Case: you want to conduct analytics on the different entry points to the long_url site). How would we accommodate unique short_urls for each successive insert of the same long_url?
This design already allows us to associate the same long URL with one, or more short URLS. To prevent the same long URL being associated to multiple short URLS the service would first have to check if the long URLS is already in use, and if so return the existing URL, which isn't something that has been described in the requirements or implementation detail.
base62 encoding can result in a string of any length. And we are not supposed to take first 7 chars to avoid collision. So that means we take what ever output we get from base64. This is what I want to hear from these videos, but believe me none of them emphasize on this. They just say blabla and use base62.
We are not encoding the original URL because as u said it would result in a string of any length. Instead we are encoding the numbers (0 - 3.5 trillion) which would not exceed 7 chars because 62^7 > 3.5 trillion.
Base62 should be considerably shorter than base 10, and the value you describe will approximate to 7 characters. Base64 is considerably shorter still, but the extra two characters might not be suitable for urls! You may be attempting to convert a string of numbers to base62 which would could create a longer base64 value, as you would be starting off in a higher base unit.