Тёмный

Training Neural Network using multiple GPUs on PARAM SHAKTI supercomputer| SLURM | Batch Job 

Pankaj Kasar
Подписаться 117
Просмотров 6 тыс.
50% 1

Опубликовано:

 

28 окт 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 29   
@vsksam
@vsksam 4 месяца назад
Thank You Bro...
@s.m.ruhulkabirhowlader7578
@s.m.ruhulkabirhowlader7578 8 месяцев назад
Thank you for this video.
@xiangli1133
@xiangli1133 Год назад
Thanks! So insightful!
@wonder_hd
@wonder_hd 2 года назад
Sir, I don't understand hindi. But I can understand from what you are doing.
@NicoleRosi
@NicoleRosi Год назад
Thank you for this video!
@devashishkprasad
@devashishkprasad 2 года назад
Thank you for this very helpful video!!
@himanshu1689
@himanshu1689 Год назад
Hi Pankaj! nice video. it will be really helpful if u share any sample code for single gpu and multi gpu.
@prashantmore7912
@prashantmore7912 6 месяцев назад
Sir how to create a module and upload.
@SubrataBarman-h1g
@SubrataBarman-h1g 11 месяцев назад
I am using WinSCP to access ParamShakti service provided by IIT KGP. But I'm facing issue while transferring data. Although I'm using high speed internet but data transfer speed very very low (hardly go upto 100 kbps). Please let me know how to solve this, and is there any alternative software to transfer data?
@pankajkasar9512
@pankajkasar9512 11 месяцев назад
Use mobaxterm
@barman5186
@barman5186 11 месяцев назад
​@@pankajkasar9512thank you very much Sir for your prompt reply. I have tried mobaxterm, but not able to connect. Can you please please address how to login with the hostname using mobaxterm or how to configure it for the same.
@anilkumarsharma8901
@anilkumarsharma8901 2 года назад
Apney subscribe👌 wale ko super computer💻 ka use karwa do phir duniya following👌 karegi😇😇
@disinlungkamei2869
@disinlungkamei2869 Год назад
Hello sir , sir if we were to run on 10 GPU nodes then how would we write our slurm script , Thank you sir
@pankajkasar9512
@pankajkasar9512 Год назад
As paramshakti there are 11 nodes having 2 GPUs each, which means for 10 GPUs you need to reserve 5 nodes, so mention Node =5, and why do you need 10 GPUs? accordingly, you need to write a script for distributed training otherwise it's not possible to use 10 GPUs.
@nitinkumarchauhan6559
@nitinkumarchauhan6559 3 года назад
Is it for sequential job Or parallel job with one gpu card or two gpu card?
@pankajkasar9512
@pankajkasar9512 3 года назад
Its sequential job having two GPUs on single node (single Machine)
@shobhitgautam7058
@shobhitgautam7058 2 года назад
sir how to create module? i am getting error that 'no module named tensorflow'
@pankajkasar9512
@pankajkasar9512 Год назад
its about no package available, you can install package by "pip install tensorflow"
@IITian1988
@IITian1988 2 года назад
Hello pankaj sir, this is Anup Mahato research scholar from IIT kharagpur, sir i have problem with Running Reg.cm model. in paramshakti Regcm is available and all libraries and mpi available but i made script file also but i am not geeting input file in the server from where i will get that file? please help me. Thanks in advance.
@pankajkasar9512
@pankajkasar9512 2 года назад
Please state your issue in detail...
@prembabupal1889
@prembabupal1889 Год назад
how to run matlab code using paramshakti
@gamermoneyfree8193
@gamermoneyfree8193 2 года назад
Sir where to apply for param shakti for remote acess ??
@pankajkasar9512
@pankajkasar9512 2 года назад
IIT Kharagpur or C-DAC team
@nitinkumarchauhan6559
@nitinkumarchauhan6559 3 года назад
Sir, thanks for this knowledgeable post. When I am trying to plot a curve testing and validation curve for epochs i am getting this error (with paramshivay). How can i overcome this issue? QStandardPaths: XDG_RUNTIME_DIR points to non-existing path '/run/user/5475', please create it with 0700 permissions. qt.qpa.screen: QXcbConnection: Could not connect to display localhost:22.0 Could not connect to any X display.
@pankajkasar9512
@pankajkasar9512 3 года назад
Why you are displaying such graphs and plots on Supercomputer while training? Don't do that. What I suggest you use "CSVLogger(csv_path)" function available at "from tensorflow.keras.callbacks import CSVLogger". Create a csv file and stire all training performance parameters in that file and then you can download it and monitor it manually. Please add this call back at model.fit() and stored in one file then you can have graphs, visualization etc.
@nitinkumarchauhan6559
@nitinkumarchauhan6559 3 года назад
@@pankajkasar9512 Ok sir... that works for me.... there is one more issue. When I am importing a pretrained model on imagenet, I am able to load only Vgg16... other models like resnet, densenet etc are not getting imported as they are showing this error..... Downloading data from github.com/keras-team/keras-applications/releases/download/densenet/densenet121_weights_tf_dim_ordering_tf_kernels_notop.h5 --------------------------------------------------------------------------- gaierror Traceback (most recent call last) /home/apps/DL-Conda-Py3.7/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args) 1316 h.request(req.get_method(), req.selector, req.data, headers, -> 1317 encode_chunked=req.has_header('Transfer-encoding')) 1318 except OSError as err: # timeout error /home/apps/DL-Conda-Py3.7/lib/python3.7/http/client.py in request(self, method, url, body, headers, encode_chunked) 1228 """Send a complete request to the server.""" -> 1229 self._send_request(method, url, body, headers, encode_chunked) 1230 /home/apps/DL-Conda-Py3.7/lib/python3.7/http/client.py in _send_request(self, method, url, body, headers, encode_chunked) 1274 body = _encode(body, 'body') -> 1275 self.endheaders(body, encode_chunked=encode_chunked) 1276 /home/apps/DL-Conda-Py3.7/lib/python3.7/http/client.py in endheaders(self, message_body, encode_chunked) 1223 raise CannotSendHeader() -> 1224 self._send_output(message_body, encode_chunked=encode_chunked) 1225 /home/apps/DL-Conda-Py3.7/lib/python3.7/http/client.py in _send_output(self, message_body, encode_chunked) 1015 del self._buffer[:] -> 1016 self.send(msg) 1017 /home/apps/DL-Conda-Py3.7/lib/python3.7/http/client.py in send(self, data) 955 if self.auto_open: --> 956 self.connect() 957 else: /home/apps/DL-Conda-Py3.7/lib/python3.7/http/client.py in connect(self) 1383 -> 1384 super().connect() 1385 /home/apps/DL-Conda-Py3.7/lib/python3.7/http/client.py in connect(self) 927 self.sock = self._create_connection( --> 928 (self.host,self.port), self.timeout, self.source_address) 929 self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) /home/apps/DL-Conda-Py3.7/lib/python3.7/socket.py in create_connection(address, timeout, source_address) 706 err = None --> 707 for res in getaddrinfo(host, port, 0, SOCK_STREAM): 708 af, socktype, proto, canonname, sa = res /home/apps/DL-Conda-Py3.7/lib/python3.7/socket.py in getaddrinfo(host, port, family, type, proto, flags) 747 addrlist = [] --> 748 for res in _socket.getaddrinfo(host, port, family, type, proto, flags): 749 af, socktype, proto, canonname, sa = res gaierror: [Errno -2] Name or service not known During handling of the above exception, another exception occurred: URLError Traceback (most recent call last) /home/apps/DL-Conda-Py3.7/lib/python3.7/site-packages/keras/utils/data_utils.py in get_file(fname, origin, untar, md5_hash, file_hash, cache_subdir, hash_algorithm, extract, archive_format, cache_dir) 221 try: --> 222 urlretrieve(origin, fpath, dl_progress) 223 except HTTPError as e: /home/apps/DL-Conda-Py3.7/lib/python3.7/urllib/request.py in urlretrieve(url, filename, reporthook, data) 246 --> 247 with contextlib.closing(urlopen(url, data)) as fp: 248 headers = fp.info() /home/apps/DL-Conda-Py3.7/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context) 221 opener = _opener --> 222 return opener.open(url, data, timeout) 223 /home/apps/DL-Conda-Py3.7/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout) 524 --> 525 response = self._open(req, data) 526 /home/apps/DL-Conda-Py3.7/lib/python3.7/urllib/request.py in _open(self, req, data) 542 result = self._call_chain(self.handle_open, protocol, protocol + --> 543 '_open', req) 544 if result: /home/apps/DL-Conda-Py3.7/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args) 502 func = getattr(handler, meth_name) --> 503 result = func(*args) 504 if result is not None: /home/apps/DL-Conda-Py3.7/lib/python3.7/urllib/request.py in https_open(self, req) 1359 return self.do_open(http.client.HTTPSConnection, req, -> 1360 context=self._context, check_hostname=self._check_hostname) 1361 /home/apps/DL-Conda-Py3.7/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args) 1318 except OSError as err: # timeout error -> 1319 raise URLError(err) 1320 r = h.getresponse() URLError: During handling of the above exception, another exception occurred: Exception Traceback (most recent call last) in ----> 1 model_densenet = Densenet() in Densenet(seed) 104 105 def Densenet(seed = None): --> 106 denseNet121 = DenseNet121(weights="imagenet", include_top=False) 107 for layer in denseNet121.layers[:149]: 108 layer.trainable = False /home/apps/DL-Conda-Py3.7/lib/python3.7/site-packages/keras/applications/__init__.py in wrapper(*args, **kwargs) 26 kwargs['models'] = models 27 kwargs['utils'] = utils ---> 28 return base_fun(*args, **kwargs) 29 30 return wrapper /home/apps/DL-Conda-Py3.7/lib/python3.7/site-packages/keras/applications/densenet.py in DenseNet121(*args, **kwargs) 9 @keras_modules_injection 10 def DenseNet121(*args, **kwargs): ---> 11 return densenet.DenseNet121(*args, **kwargs) 12 13 /home/apps/DL-Conda-Py3.7/lib/python3.7/site-packages/keras_applications/densenet.py in DenseNet121(include_top, weights, input_tensor, input_shape, pooling, classes, **kwargs) 309 input_tensor, input_shape, 310 pooling, classes, --> 311 **kwargs) 312 313 /home/apps/DL-Conda-Py3.7/lib/python3.7/site-packages/keras_applications/densenet.py in DenseNet(blocks, include_top, weights, input_tensor, input_shape, pooling, classes, **kwargs) 278 DENSENET121_WEIGHT_PATH_NO_TOP, 279 cache_subdir='models', --> 280 file_hash='30ee3e1110167f948a6b9946edeeb738') 281 elif blocks == [6, 12, 32, 32]: 282 weights_path = keras_utils.get_file( /home/apps/DL-Conda-Py3.7/lib/python3.7/site-packages/keras/utils/data_utils.py in get_file(fname, origin, untar, md5_hash, file_hash, cache_subdir, hash_algorithm, extract, archive_format, cache_dir) 224 raise Exception(error_msg.format(origin, e.code, e.msg)) 225 except URLError as e: --> 226 raise Exception(error_msg.format(origin, e.errno, e.reason)) 227 except (Exception, KeyboardInterrupt): 228 if os.path.exists(fpath): Exception: URL fetch failure on github.com/keras-team/keras-applications/releases/download/densenet/densenet121_weights_tf_dim_ordering_tf_kernels_notop.h5: None -- [Errno -2] Name or service not known
@nitinkumarchauhan6559
@nitinkumarchauhan6559 3 года назад
@@pankajkasar9512 Kindly guide me Sir.... I have stucked with the mentioned issue. I am unable to fetch pretrained models.
@pankajkasar9512
@pankajkasar9512 3 года назад
@@nitinkumarchauhan6559 Select the URL "github.com/keras-team/keras-applications/releases/download/densenet/densenet121_weights_tf_dim_ordering_tf_kernels_notop.h5", paste at browser, download it manually and then use that manual downloaded weights instead of weights= "imagenet" use this weights="resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5" with proper file path. then it will work. Do above changes at """ Pre-trained ResNet50 Model """ resnet50 = ResNet50(include_top=False, weights="resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5", input_tensor=inputs) Similar is applicable for VGG16UNET and rest all
@nitinkumarchauhan6559
@nitinkumarchauhan6559 3 года назад
@@pankajkasar9512 Thank you very much sir
Далее
Using multiple GPUs for Machine Learning
43:24
Просмотров 7 тыс.
Viral Video of a Man's Crazy Job Interview
16:02
Просмотров 1,5 млн
Using A Supercomputer
29:15
Просмотров 26 тыс.
Slurm Batch Scripting
12:31
Просмотров 41 тыс.
Next.js 15 Breakdown (Everything You Need To Know)
18:10
Slurm Job Management
1:19:55
Просмотров 10 тыс.
What does accessing a supercomputer look like?
7:14
Просмотров 29 тыс.
2020 Seminar Series: Slurm
29:05
Просмотров 11 тыс.