Pytorch torch.distributed.launch v. torchrun

Question

I found this GitHub Repo, https://github.com/hbzju/pico, which trains a pytorch model to identify mislabeled data in a dataset. I don't have much experience with torch and am trying to modify the repo's code to train the model on my dataset. Running this command:

python3 -m torch.distributed.launch --nproc_per_node 2 train.py \

--exp-dir experiment/ \

--dataset catalyst800 \

--num-class 2

produces the the following error:

[2023-10-26 12:37:23,438] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 66047) of binary: /usr/local/bin/python3

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================

train.py FAILED

------------------------------------------------------------

Failures:

[1]:

time : 2023-10-26_12:37:23

host : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa

rank : 1 (local_rank: 1)

exitcode : 2 (pid: 66048)

error_file: <N/A>

traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

------------------------------------------------------------

Root Cause (first observed failure):

[0]:

time : 2023-10-26_12:37:23

host : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa

rank : 0 (local_rank: 0)

exitcode : 2 (pid: 66047)

error_file: <N/A>

traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

============================================================

Obviously the local_rank arg is an issue. I tried to pass:

parser.add_argument("--local_rank", type=int, default=0)

but I'm still getting that error. I tried to implement torchrun but haven't figured that out.

Besides these issues, I suspect my dataset is not formatted to be processed as a torch dataset?

def load_catalyst800(filepath, partial_rate=None):

"""Loads a CSV dataset into a Pandas DataFrame.

Args:

- filepath: Path to the CSV file.

- partial_rate: Optional proportion of data points to be labeled as partial.

- If None, all data points will be labeled as complete.

Returns:

- A Pandas DataFrame containing the data.

"""

filepath = 'vectorized_catalyst_data.csv'

# Read data in Pandas

data = pd.read_csv(filepath)

# Convert to Numpy

data_numpy = data.to_numpy()

# Convert the NumPy array to a tensor.

csv_tensor = torch.from_numpy(data_numpy)

return csv_tensor

I'd love to connect with someone with experience running distributed jobs using Pytorch! I'm not really sure how to set up this library correctly for this purpose. Thanks!

Benjamin M. · Accepted Answer

Hi Erin,﻿You're facing multiple issues here: a problem related to torch.distributed.launch and another concerning data loading and formatting. Let's tackle them one by one.torch.distributed.launch vs. torchrunBoth torch.distributed.launch and torchrun are used for distributed training, but torchrun is newer and generally simpler to use.

torch.distributed.launch: It's an older utility and requires you to pass the --local_rank argument manually to your script.

torchrun: Provides a simpler interface and doesn't require manual handling of --local_rank.

If you're facing issues with torch.distributed.launch, switching to torchrun might simplify things. The equivalent command would be:
bashCopy code
torchrun --nproc_per_node=2 train.py --exp-dir experiment/ --dataset catalyst800 --num-class 2
Dataset FormattingYour load_catalyst800 function seems to read a CSV file into a Pandas DataFrame, convert it to a NumPy array, and finally convert it to a PyTorch tensor. While this approach does produce a tensor, it's not quite a PyTorch Dataset, which often comes with built-in methods for batching, shuffling, and transformations.To convert your tensor into a Dataset, you can use the TensorDataset class from PyTorch:
pythonCopy code

from torch.utils.data import TensorDataset

def load_catalyst800(filepath):
    data = pd.read_csv(filepath)
    data_numpy = data.to_numpy()
    csv_tensor = torch.from_numpy(data_numpy)
    
    dataset = TensorDataset(csv_tensor)
    return dataset
After this, you can create a DataLoader:
pythonCopy code

from torch.utils.data import DataLoader

dataset = load_catalyst800('vectorized_catalyst_data.csv')
data_loader = DataLoader(dataset, batch_size=32, shuffle=True)
Debugging Steps

Error Messages: The error logs indicate that local_rank is causing problems. Ensure that your train.py script can accept --local_rank as an argument.

Data Format: Verify if your data is formatted correctly for PyTorch processing. Is it a classification problem? Are labels included in your data tensor?

Library Compatibility: Make sure that the PyTorch version you're using is compatible with the GitHub repo you're trying to modify. Library inconsistencies often lead to unexpected errors.

If you found this answer useful, I would sincerely appreciate a positive review as I am new on Wyzant and reviews are incredibly helpful for me.Thank you,Benjamin M.

Pytorch torch.distributed.launch v. torchrun

1 Expert Answer

torch.distributed.launch vs. torchrun

Dataset Formatting

Debugging Steps

Still looking for help? Get the right answer, fast.

OR

RELATED TOPICS

RELATED QUESTIONS

Hello, I am trying to open a file in read mode using Python.

Assume the days of the week are numbered 0,1,2,3,4,5,6 from Sunday to Saturday.(using python)

Nested Functions

I have to create a store receipt in Python 3.4.1. It has to allow unlimited input of items, show subtotal, tax, and total. Please help.

How do i go about seperating my numerical data into fuzzy sections and putting it into code in Python ?

RECOMMENDED TUTORS

IXL

Rosetta Stone

Education.com

TPT

Vocabulary.com

ABCya

SpanishDictionary.com

Inglés.com

Emmersion

Pytorch torch.distributed.launch v. torchrun

1 Expert Answer

torch.distributed.launch vs. torchrun

Dataset Formatting

Debugging Steps

Still looking for help? Get the right answer, fast.

OR

RELATED TOPICS

RELATED QUESTIONS

Hello, I am trying to open a file in read mode using Python.

Assume the days of the week are numbered 0,1,2,3,4,5,6 from Sunday to Saturday.(using python)

Nested Functions

I have to create a store receipt in Python 3.4.1. It has to allow unlimited input of items, show subtotal, tax, and total. Please help.

How do i go about seperating my numerical data into fuzzy sections and putting it into code in Python ?

RECOMMENDED TUTORS

find an online tutor