Something is seriously off with your fast matmul implementation, it's 3 orders of magnitude slower than the built-in method (12.5 ms vs 8.82 us)? You probably have some host-device copying going on?
The matmul example shown is the example from the numba documentation so I don't think it's wrong. It's (relatively) slow because matrix multiplication is something that is so common, it is insanely optimized in available implementations. You won't write a matrix multiplication implementation with numba that's faster than cupy. But if you have something custom you need to do, a custom kernel can be faster than a combination of cupy operations.
Thank you for this tutorial, it has been very helpful! But since it is only an introduction could anyone tell me what I should watch or read next on this topic? Thanks in advance for the advice!
GPUs aren't general purpose... sigh... They are really good at specific executing the same operation on many data banks. It just happens to be similair type of needs for graphics an machine learning
There is a python opencl package (pyopencl) a = pyopencl.array.arange(queue, 400, dtype=numpy.float32) b = pyopencl.array.arange(queue, 400, dtype=numpy.float32) krnl = ReductionKernel(ctx, numpy.float32, neutral="0", reduce_expr="a+b", map_expr="x[i]*y[i]", arguments="__global float *x, __global float *y") my_dot_prod = krnl(a, b).get() 🙂 Benefit is it works on ALL GPU's not only Nvidia, (works on intel built in cpu gpu's and on amd gpus)
I've thought about it but it's a lot of work to make and edit a silly video like this, and at the moment I really don't have the time. I don't get anything for making these videos.
Wait. At 12:10, the narrator says the timeit magic function reports a duration of 5 ms, but the number is 0.01 ms from 6 ms. The number us far away from 5 compared to 6. It shoukd be 6 ms if he's rounding, not 5 ms. He's truncating the decimals to arrive at an integer.
I have been looking into gpu programming using numba and python for a while, this seems to be the best tutorial I was able to find so far.. . thank you
I suggest using "conda install conda-forge::package" over "conda install -c conda-forge package". These mean different things: 1. Install this one package from conda-forge -- install dependencies and other packages from my channel settings. 2. Install this package and look for ALL packages on conda-forge first. This makes conda-forge the highest priority channel, so if you use "defaults" you will see many packages getting replaced by the same version - and again if you run another install command on the same environment without the "-c" (the same packages will get reinstalled from defaults).
In general, the packages on conda forge aim to be interoperable, so ideally all the packages in your environment should be from conda-forge. There is no guarantee packages will work together when they come from different channels, for example C and Fortran packages if they were compiled using different compilers. Conda-forge standardizes on these details and ensures that dependencies come from conda-forge during build.
@@nickcorn93 If all of the packages are coming from conda-forge, then there is no need to specify -c, --channel. If one is using -c, then they are most likely using defaults and cherry picking a few packages from conda-forge. In any case, those two invocations look very similar but act differently when conda is using strict channel priority (the default for 99% of people). I agree that it's better to be all-in (or all out) when using conda-forge; the packages are tested very well and built using the same build-chains and configurations. It's when mixing channels that one can run into inscrutable problems. The fact that defaults and conda-forge work so well together most of the time makes the tiny inconsistencies more surprising for most users.
Thank you so much. Probably the best introdution to CUDA with Python. The example you use, while very basic, touches on usage of blocks, which is usually omitted in other introduction-level tutorials. Great stuff! Hope you return with some more videos. I have subscribed!
Good job. A mention about using the ‘build:skip # [win]’ or similar to only build for linux would be helpful (you mentioned why it would be helpful, but not how to do it).
Best to have a look at this example github.com/conda-forge/exitwavereconstruction-feedstock/blob/main/recipe/meta.yaml and check out the docs conda-forge.org/docs/maintainer/adding_pkgs.html#build
What is your OS? You may be having issues if you are using windows and pip. Easiest to install cupy in a conda virtual environment, as it will also install the cuda toolkit.
@@nickcorn93 Sorry for bother you, the problem was not installing Cuda Toolkit, srly I hate people who doesnt watch full video closely and ask stupid questions....and now I m one of them :D. Thx alot for this tutorial in 2 months i will try write my own GPU operator for my program, would be interting if this will be faster than CPU. (Btw using normal Visual code in python 3.10 env. on win 11, so far so good. (Altrough i have some code output delay problem when using openCV for some strange reason)