DWS is a leading publicly listed Australian IT Services company, providing services to blue chip organisations since 1991. With a business philosophy based upon integrity, reliability and professional service delivery, DWS provides end to end IT solutions.
Great presentation. For Ragged hierarchy example of salesperson and managers would be highly relatable. Like sales is attributed to a salesperson and many salesperson fall into a specific manger and then senior manager..
This is a good video on Dimensional Modeling based on a case study - highly recommended ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-7HyGM3Iw0Kc.html
This is an absolute goldmine if one wants to get the basics of dimensional modelling. Time really well spent. Thank you DWS and Ross for publishing this!
I have one doubt- during B-Tree slide, how come 400 entries comes in one block of 8KB if you are creating an index on a 8 byte field? As per my calculation it will be 8000/8 = 1000 entries, does it make sense? Will help if you anyone can clarify.
Actually, a DW is the end product of a data journey What Organizations, Institutions and Companies are now implementing are 1) Source to Data Lakes (Exact Copies of the Source) 2) Data Lakes to Staging (Prepare the data) 3) Staging to DW
In example 5. select c.customer_name , c.customer_address, i.invoice_num, i.invoice_date,i.invoice_amount from Customer c, ---100 k rows Invoice i ---- 10 M rows where c.cust_id=i.cust_id AND c.customer_name=' Tarun Kumar' -- exactly one row AND i.invoice_date >= (sysdate-12) -- 20 % of all rows If we are scanning more than 20% of rows should we add the index on an invoice date? Does not it violate the thumb rule?
Most RDBMS actually use B+trees as a data structure for an index. So, saying an index is a B-tree is an over-simplification of this and is generally not used as a data-structure, maybe the presenter just said it as an example. If your interviewer asked specifically a "B-Tree", then shame on him for acting so unprofessional! Otherwise, he might have expected for you to correct him or know of a B+Tree. Anyway, I hope you found this information useful and hope you got the next great job!
Confirming dimensions are dimension table that combine or the ones that provide access two fact tables through them, but this was explained in very round about way and it was not clear
Do you mean, "can we *safely* filter on ...?" And by *safely*, I mean without risk of picking up duplicate rows in the fact. Yes, you can, but it doesn't make much sense. The purpose of using the ragged hierarchy bridge (in my example, the CUSTOMER_BRIDGE) was to find facts that belong to children (or grandchildren, etc) of the chosen customer(s).By filtering CHILD_CUSTOMER_KEY = PARENT_CUSTOMER_KEY, we get the subset of the bridge that includes only level-0 relations (ie. level-2 = self, children and grandghildren; level-1 = self and child; level-0 = self only).The LEVELS_FROM_PARENT attribute provides this same functionality (LEVELS_FROM_PARENT = 0) in a slightly more intuitive fashion, and it also extends to deeper levels (1, 2, etc).But even THAT is not the best solution. If you're not interested in picking up facts for children and grandchildren etc, then don't use the bridge table at all - just join the fact to the CUSTOMER dimension.
Hi Ross just thinking, in your treatment-diagnosis example, during ETL load can't I group the rows to have coma separated diagnosis, which then simplifies the architecture and there is no need of any bridge table. So you have patient joined to treatment fact on patient key and with diagnosis appearing in treatment fact as comma separated values? We also eliminate the exercise of "constructing" possible "groups of diagnosis", to be more close to real world. Also with this architecture then we have no need for weighting factor.... your thoughts please.
Hi AT. Thanks for the question. Yes, you can indeed have a column containing repeating elements--either comma separated, XML, or some other well-formed scheme for storing such data. But you've only solved one part of the problem - storage. What about querying? After all, that's why we're here.For your comma-separated example, when you want to search for cases with diagnoses including FRACTURE - TIBIA, you can't query on DIAGNOSIS = 'FRACTURE - TIBIA', you need to include wildcards. e.g. DIAGNOSIS LIKE '%FRACTURE - TIBIA%'.This opens up a new world of problems, like how do we treat strings that are a subset of other valid strings. For example, if our medical records system used particularly archaic language, we might record malaria as AGUE, but then DIAGNOSIS LIKE '%AGUE%' might well serve up results of PLAGUE, which would be undesirable.Bridge tables resolve these problems with predictable and intuitive results.
In Q&A, talking about network latency as a reason for performance degradation reading from disks, because the data has to go back and forth - Even if the data has to be read from the buffer cache where the buffer cache is also in the server memory and hence there is a round trip (network latency). So, doesn't it just come down to only the quick access from memory versus relatively slow access from the disk? Because, there is network latency in both? Thanks
Hi MGM L. I'm not sure I understand your question. Are you saying the cost of a round trip to memory is the same as Network Latency. I'm no hardware geek, so I'm prepared to be shouted down on this, but Memory (RAM) is resident on the machine and directly accessible by the CPU. Latency to pick something out of RAM is vanishingly fast.Disk systems CAN be resident on the machine (like the SSD in my laptop), but are not directly accessible by the CPU. It needs to send a message to the disk controller to find and read data. That round-trip is called (at least I call it) Disk Latency, and it is fast, but nothing like as fast as reading from memory.Enterprise databases don't exist on my laptop. More often than not, their disk storage is NOT resident on the machine - it's on a whole nuther machine. When the database wants something off that disk system, it needs to send a message across the network to that machine, and THAT'S Network Latency - it takes time.
I want to learn your way - in the end you said there will be more presentations in future. Hope you have given them. But sadly, I was not able to find the links in RU-vid. Can you help me finding those. Thanks in advance.
Ross ... Its a pleasure watching your presentation on Dimensional Modeling.(I watched all 3 continuously) I believe going through the 3rd Edition of toolkit, though Ralph strongly opposes urges for normalization and believes in dimensional approach for modeling. In my opinion SCD4/SCD5 are definitely an alternate definition of Normalization to tackle Monster Dimensions. Similarly Junk Dimensions/Out-trigger are concept that is indeed supportive to Normalization. I feel both Bill Inmon & Ralph Kimball's approach are widely used. I am fascinated by Dimensional Modeling, I had been involved in projects with Normalization approach and would love to make foundation more solid in Dimensional Modeling. Thanks for all the insightful video tutorials, enjoyed and relished every part of the session. Thanks for providing it to the larger audience via RU-vid.
A very elegant presentation. The exercise was perfect and I feel the guy representing Buffer Cache will definitely remember oracle index working for a long long time.
I am learning Dataware house concepts and your video is really nice to start with, thanks for it! it was a long video I took 2 days to watch it :), but loved the kitchen. please keep posting its nice to listen to you!