Тёмный

Why GPUs Outpace CPUs? 

DigitalSreeni
Подписаться 109 тыс.
Просмотров 2,5 тыс.
50% 1

A Deep Dive into Why GPUs Outpace CPUs - A Hands-On Tutorial
FLOPS is commonly used to quantify the computational power of processors and other computing devices. It is an important metric for tasks that involve complex mathematical calculations, such as scientific simulations, artificial intelligence and machine learning algorithms.
FLOPS stands for "Floating Point Operations Per Second" which means the number of floating-point calculations a computer system can perform in one second. The higher the FLOPS value, the faster the computer or processor can perform floating-point calculations, indicating better computational performance.
In this tutorial, let us use FLOPS as a metric to evaluate the performance of CPU versus GPU. We will begin by employing the DAXPY (Double-precision A*X plus Y) operation, a commonly used operation in numerical computing. This operation involves multiplying a scalar (A) with a vector (X) and adding the result to another vector (Y). We will calculate FLOPS to perform the DAXPY operation using both the CPU and GPU, respectively.
The DAXPY operation is executed using NumPy operations (A * X + Y). NumPy can leverage optimized implementations, and the actual computation may occur in optimized C or Fortran libraries. Therefore, a more effective way to compare speeds is by conducting matrix multiplications using TensorFlow. The second part of our code is designed to accomplish precisely this task. We will perform matrix multiplications of various-sized matrices and explore how the true advantage of GPUs lies in working with large matrices (datasets in general).
In the second part of this tutorial, we will verify the GPU speed advantage over CPU for different matrix sizes. The relative efficiency of the GPU compared to the CPU can vary based on the computational demands of the specific task.
In order to make sure we start with a common base line for each matrix multiplication task, we will clear the default graph and release the GPU memory. We will also disable the eager execution in TensorFlow for the matrix multiplication task. Please note that eager execution is a mode that allows operations to be executed immediately as they are called, instead of requiring them to be explicitly executed within a session. Eager execution is enabled by default in TensorFlow 2.x. By disabling eager execution, operations are added to a computation graph, and the graph is executed within a session.
Finally, Forget FLOPS, it's all about the memory bandwidth!!!
Memory bandwidth is a measure of how quickly data can be transferred between the processor (CPU or GPU) and the memory.
High memory bandwidth is crucial for tasks that involve frequent access to large datasets (e.g., deep learning training)
Memory bandwidth becomes particularly important when dealing with large matrices, as transferring data between the processor and memory efficiently can significantly impact overall performance.
Code used in this video is available here: github.com/bns...
Original title: Why GPUs Outpace CPUs? (tips tricks 56)

Наука

Опубликовано:

 

14 окт 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 17   
@aaalexlit
@aaalexlit 8 месяцев назад
Awesome as always, thank you! any chance to have a follow-up that includes TPUs?
@vitor-ce2ql
@vitor-ce2ql 8 месяцев назад
Hello, I recently discovered your Channel, I thought it was very good, and you are very kind, and you have a good explanation, congratulations, I want your channel to grow ,If my English is bad, sorry, because I'm Brazilian
@alihajikaram8004
@alihajikaram8004 8 месяцев назад
Hi, I found your channel very informative and thanks for your great educational videos. Would you make a video about using conv1d in time series? Could we use it for feature extraction?
@Ayzal.Y_Liebe
@Ayzal.Y_Liebe 4 месяца назад
Sir what type of laptop should i get to do deep learning?what do you recommend?
@khangvutien2538
@khangvutien2538 7 месяцев назад
At 7:05, I see in the specs sheet 256 Tensor cores. Are they the same tensor processing as in TPU? Maybe you can also explain TPU? Note that I’m just starting to watch. Maybe you will explain later in the video?
@DigitalSreeni
@DigitalSreeni 7 месяцев назад
The tensor cores in GPUs and TPUs involve tensor processing but they are different technologies designed for different purposes. GPUs are more general-purpose and versatile, suitable for a range of tasks like gaming, graphics rendering, and parallel computing workloads. TPUs are purpose-built for machine learning and are highly optimized for tensor operations.
@hamidgholami2683
@hamidgholami2683 8 месяцев назад
Hi sir hope your doing well May i ask you to make some videos relating to instance segmentation ? I mean a good explanations and also doing some projects based on that? I will be happy if you respond
@scrambledeggsandcrispybaco2070
@scrambledeggsandcrispybaco2070 8 месяцев назад
Hi DigitalSreeni, I have been using your tutorials as a guideline for segmentation using traditional machine learning. Apeer has changed a lot since your videos were made. When I export the file it gives masks for different classes separately. What can I do ? Thank you for all your knowledge, you are a life safer.
@msaoc22
@msaoc22 8 месяцев назад
thank you for amazing video and time you spend on us=)
@anshagarwal9826
@anshagarwal9826 7 месяцев назад
@DigitalSreeni hi can you explain why you divide array size by time to calculate FLOPS how does it give what floating point operations per second it took, what I understood from your calculation is that you might be considering an estimation like how much time it will take to build the newly calculated array is ~ saying FLOPs.
@DigitalSreeni
@DigitalSreeni 7 месяцев назад
The calculation of FLOPS in my code is based on the time taken to perform a specific operation (e.g., DAXPY) on arrays of a given size. The rationale behind this calculation is that it estimates the rate at which floating-point operations are executed per second. If you consider the DAXPY operation (A * X + Y), each element in the arrays X and Y undergoes a multiplication and an addition, which are floating-point operations. So the total number of floating-point operations is proportional to the array size. It provides a rough measure of the performance in terms of floating-point operations per second. In reality, the actual number depends on the type of operations and of course the underlying hardware.
@anshagarwal9826
@anshagarwal9826 7 месяцев назад
Thanks ​@@DigitalSreenimuch appreciated 👍
@zainulabideen_1
@zainulabideen_1 8 месяцев назад
Found Amazing information, thanks ❤❤❤
@DigitalSreeni
@DigitalSreeni 8 месяцев назад
Glad it was helpful!
@vidyasvidhyalaya
@vidyasvidhyalaya 8 месяцев назад
Sir please upload a seperate video for convert this "195 - Image classification using XGBoost and VGG16 imagenet as feature extractor" into loal web application sir...please sir....dont skip my comment sir...please sir...awaiting to see the video sir
@tektronix475
@tektronix475 8 месяцев назад
I got 5k times faster for the T4 GPU setup for a 10000 matrix size. which is disheartening and eye popping at the same time.
@fergalhennessy775
@fergalhennessy775 29 дней назад
w rizz
Далее
338 - Understanding the Benford's Law of Probability
20:51
313 - Using genetic algorithms to simulate ​evolution
18:08
А я с первого раза прошла (2024)
01:00
Do we really need NPUs now?
15:30
Просмотров 656 тыс.
100x Faster Than NumPy... (GPU Acceleration)
28:49
Просмотров 84 тыс.
The 6.9GHz CPU - LN2 Cooling
17:01
Просмотров 3,2 млн
How GPU Computing Works | GTC 2021
39:36
Просмотров 33 тыс.
312 - What are genetic algorithms?
13:07
Просмотров 6 тыс.
AMD’s 128 Core MONSTER - Epyc Bergamo
22:57
Просмотров 2,1 млн
Writing Code That Runs FAST on a GPU
15:32
Просмотров 562 тыс.
bulletproof❌ Nokia✅
0:17
Просмотров 61 млн