Erin O.

asked • 10/27/23

Pytorch torch.distributed.launch v. torchrun

I found this GitHub Repo, https://github.com/hbzju/pico, which trains a pytorch model to identify mislabeled data in a dataset. I don't have much experience with torch and am trying to modify the repo's code to train the model on my dataset. Running this command:

python3 -m torch.distributed.launch --nproc_per_node 2 train.py \

 --exp-dir experiment/ \

 --dataset catalyst800 \

 --num-class 2

produces the the following error:


[2023-10-26 12:37:23,438] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 66047) of binary: /usr/local/bin/python3

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

============================================================

train.py FAILED

------------------------------------------------------------

Failures:

[1]:

 time   : 2023-10-26_12:37:23

 host   : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa

 rank   : 1 (local_rank: 1)

 exitcode : 2 (pid: 66048)

 error_file: <N/A>

 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

------------------------------------------------------------

Root Cause (first observed failure):

[0]:

 time   : 2023-10-26_12:37:23

 host   : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa

 rank   : 0 (local_rank: 0)

 exitcode : 2 (pid: 66047)

 error_file: <N/A>

 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

============================================================


Obviously the local_rank arg is an issue. I tried to pass:

parser.add_argument("--local_rank", type=int, default=0)

but I'm still getting that error. I tried to implement torchrun but haven't figured that out.


Besides these issues, I suspect my dataset is not formatted to be processed as a torch dataset?

def load_catalyst800(filepath, partial_rate=None):

"""Loads a CSV dataset into a Pandas DataFrame.

Args:

- filepath: Path to the CSV file.

- partial_rate: Optional proportion of data points to be labeled as partial.

- If None, all data points will be labeled as complete.

Returns:

- A Pandas DataFrame containing the data.

"""

filepath = 'vectorized_catalyst_data.csv'

# Read data in Pandas

data = pd.read_csv(filepath)

# Convert to Numpy

data_numpy = data.to_numpy()

# Convert the NumPy array to a tensor.

csv_tensor = torch.from_numpy(data_numpy)

return csv_tensor


I'd love to connect with someone with experience running distributed jobs using Pytorch! I'm not really sure how to set up this library correctly for this purpose. Thanks!

1 Expert Answer

By:

Erin O.

Absolutely! Let me take a look at your suggestions and I would be happy to leave a review:)
Report

10/27/23

Erin O.

ok so I think the problem was ultimately that I was trying to run using NVIDIA GPUs which require installing the NVIDIA CUDA toolkit. My research indicates that this tool is no longer compatible with Apple computers? I do still need to try running on CPUs. I’m not seeing where I’m able to leave you a review? I might only be able to do that after a lesson? I’ve only seen that prompt after a lesson with a tutor.
Report

11/05/23

Still looking for help? Get the right answer, fast.

Ask a question for free

Get a free answer to a quick problem.
Most questions answered within 4 hours.

OR

Find an Online Tutor Now

Choose an expert and meet online. No packages or subscriptions, pay only for the time you need.