ML-Notes: Difference between revisions

From Depth Psychology Study Wiki
mNo edit summary
 
(12 intermediate revisions by the same user not shown)
Line 2: Line 2:
This will run the model in the hardware that your system has aviable to it. Defaulting to CPU if neither Apple Silicon (mps), NVIDIA CUDA (cuda) or AMD Radeon Open Compute is found on the system.
This will run the model in the hardware that your system has aviable to it. Defaulting to CPU if neither Apple Silicon (mps), NVIDIA CUDA (cuda) or AMD Radeon Open Compute is found on the system.
  device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "hip" if torch.backends.hip.is_available() else "cpu"
  device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "hip" if torch.backends.hip.is_available() else "cpu"
If using numpy plot data, you may run into device issues, as numpy uses cpu to do calculation. To fix this, cast the Tensor to cpu device, with .cpu(), e.g.:
tensor_var.cpu()
When going from numpy to tensors, will have to use .'''from_numpy'''(x), this is usually done when importing from skikit-learn e.g:
<pre>
# Import necessary libraries
import sklearn
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split
import torch  # Ensure PyTorch is imported for tensor operations
# Define the number of samples
n_samples = 1000
# Generate a synthetic dataset of circles
X, y = make_circles(n_samples=n_samples,  # Total number of samples
                    noise=0.03,          # Add slight noise to the data
                    random_state=42)      # Ensures reproducibility
# Convert the dataset to PyTorch tensors
X = torch.from_numpy(X).type(torch.float)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,                # Features and labels
    test_size=0.2,      # 20% of data will be used for testing
    random_state=42      # Ensures reproducibility of the split
)
# Output shapes for verification
print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")
print(f"Training labels shape: {y_train.shape}")
print(f"Testing labels shape: {y_test.shape}")
</pre>


== Creating Valid Data ==
== Creating Valid Data ==
Line 24: Line 63:


== Setting the Algorithm ==
== Setting the Algorithm ==
This particular problem uses the Linear Regression algorithm, which is rendered like this in code:
The Linear Regression algorithm is used for regression problems, for instance, creating a line for house price vs square footage, or house price and state the simplified representation of the linear relationship between X (the feature or input) and y (the target or output) using the linear regression equation:
  y = weight * X + bias
  y = weight * X + bias


== Creating Training/Testing Split ==
== Creating Training/Testing Split ==
Normally, when training split the data 80/20.  80 for training, and 20 for testing.
Normally, when training, split the data 80/20.  80 for training, and 20 for testing.


Something like:
Something like:
Line 54: Line 93:


  self.bias = nn.Parameter(torch.randn(1, requires_grad=True,dtype=torch.float)) # creates a models parameter, using random
  self.bias = nn.Parameter(torch.randn(1, requires_grad=True,dtype=torch.float)) # creates a models parameter, using random
def forward(self, x: torch.Tensor) -> torch.Tensor: # REQUIRED: Forward method is required for all nn.Module subclasses, it needs to overide the forward method in the nn.Module class
def forward(self, x: torch.Tensor) -> torch.Tensor: # REQUIRED: Forward method is required for all nn.Module subclasses, it needs to override the forward method in the nn.Module class


return self.weights * x + self.bias
return self.weights * x + self.bias (the Linear Regression)


Inside that you will need to initialize the the weights and biases, usually to random or zero, and set the forward loop. The forward loop is required.
Inside that you will need to initialize the the weights and biases, usually to random or zero, and set the forward loop. The forward loop is required.
Line 69: Line 108:
For '''regression''', you will want to use MAE <code>nn.L1Loss()</code>, or MSE <code>nn.MSELoss().</code>
For '''regression''', you will want to use MAE <code>nn.L1Loss()</code>, or MSE <code>nn.MSELoss().</code>


For '''classification''', you might want to use binary cross entropy<code>nn.BCELoss() or nn.BCEWithLogitsLoss() ((recommended)</code>, or categorical cross entropy<code>nn.CrossEntropyLoss()</code>which can also be used for mutli-class classification.
For '''classification''', you might want to use binary cross entropy<code>nn.BCELoss() or nn.BCEWithLogitsLoss() (recommended)</code>, or categorical cross entropy<code>nn.CrossEntropyLoss()</code>which can also be used for mutli-class classification.
 
BCELoss does not have the sigmoid activation function, while BCEWithLogitsLoss combines the sigmoid activation.  BCEWithLogitsLoss is more stable, because it has one layer, instead of first doing BCELoss, then Sigmoid in a different layer.


BCELoss does not have the sigmoid activation function, while BCEWithLogitsLoss combines the sigmoid activation. However BCEWith...it is more stable, because it has one layer, instead of first doing BCELoss, then Sigmoid in a different layer.
When Loss function is used, it must first be assigned, ie:
 
loss_fn = nn.CrossEntropyLoss()
 
loss = loss_fn(logits, train)
 
When doing the backward propagation, use:
 
loss.backward()


== Optimizer ==
== Optimizer ==
Use Stochastic Gradient Descent (SDG) Optimizer for '''Classification''', '''Regression''', and others <code>torch.optim.SDG()</code>
Use Stochastic Gradient Descent (SDG) Optimizer for '''Classification''', '''Regression''', and others <code>torch.optim.SDG()</code>


Use Adam Optimizer for '''Classification''', '''Regression''' and others<code>torch.optim.Adam() but torch.optim.AdamW() is recommended.</code>
Use Adam Optimizer for '''Classification''', '''Regression''' and others <code>torch.optim.AdamW().</code>
 
When optimizer is used it must be assigned:
 
optimizer = torch.optim.AdamW(params=model_4.parameters(),lr=0.01)
 
When doing the step function, do:
 
optimizer.step()
 
== Zero Grad ==
When doing a looped forward, and backward propagation, you have to zero out the accumulated gradient descent by using:
 
optimizer.zero_grad()


== Logits ==
== Logits ==
'''Logits''' represent the unprocessed outputs directly from the model after a forward pass.
In a machine learning model, '''logits''' are the raw, unprocessed outputs produced directly after a forward pass through the model. These logits are unnormalized values and require further processing to interpret as probabilities or predictions (labels).
 
When using <code>BCEWithLogitsLoss</code>, the model will output '''raw logits''' (unnormalized values) we must convert them into -> prediction probabilities (sigmoid) -> and then into prediction labels (round)


To get this, run the forward pass, then torch.sigmoid(), then torch.round() or torch.round(torch.sigmoid()) in full. This is used for binary classification, as the outputs will be 0 or 1.
===== Using <code>BCEWithLogitsLoss</code> =====
For binary classification tasks, it is common to use the <code>BCEWithLogitsLoss</code> loss function, which works directly with logits. This eliminates the need to manually apply a sigmoid activation during training, as the loss function internally computes it. However, for inference or evaluation, you must explicitly process the logits to obtain meaningful predictions.


Our Model outputs are going to be raw '''logits'''.
====== Steps to Convert Logits to Predictions ======
1) '''Convert Logits to Probabilities''': Use the sigmoid activation function to map logits to probabilities in the range [0,1]. This is done as:
probabilities = torch.sigmoid(logits)
2) '''Convert Probabilities to Labels''': Apply a threshold (e.g., 0.5) to determine class labels. This can be done using rounding:
labels = torch.round(probabilities)
Or, combine the steps:
labels = torch.round(torch.sigmoid(logits))


We can convert these '''logits''' into '''prediction probabilities''' by passing them to some kind of activation function
=== Example Workflow ===
1) Perform a forward pass to obtain raw logits:
logits = model(inputs)
2) Apply the sigmoid activation to obtain probabilities:
probabilities = torch.sigmoid(logits)
3) Round probabilities to get binary class labels (0 or 1):
labels = torch.round(probabilities)


(e.g. sigmoid for binary classification and softmax for multiclass classification).
=== General Notes on Activation Functions ===


Then we can convert our model's prediction probabilities to '''prediction labels''' by either rounding them or taking the <code>torch.argmax()</code>, which is used for '''multiclass classification''', taking the index of the maximum probability.
* For '''binary classification''', use the '''sigmoid''' function, as described above.
* For '''multi-class classification''', use the '''softmax''' function to convert logits into probabilities across multiple classes.


== Types of Learning and their Optimal Algorithms ==
== Types of Learning and their Optimal Algorithms ==
Line 104: Line 178:
==== 2. Logistic Regression ====
==== 2. Logistic Regression ====


'''Use Case''': Predicting binary or categorical outcomes, such as determining whether an email is spam or not.
* '''Use Case''': Predicting binary or categorical outcomes, such as determining whether an email is spam or not.
 
* '''Optimizer''': <code>torch.optim.SGD</code> or <code>torch.optim.AdamW</code>
'''Optimizer''': <code>torch.optim.SGD</code> or <code>torch.optim.Adam</code>
* '''Loss Function''':
 
** <code>torch.nn.BCELoss</code>
'''Loss Function''':
*** Does not apply sigmoid, or round, has to be done manually. (Not recommended for modern models)
** <code>torch.nn.BCEWithLogitsLoss</code> (Recommended for binary classification)
*** Internally applies a sigmoid before computing binary cross-entropy, offering better numerical stability.
** <code>torch.nn.CrossEntropyLoss</code> (for multi-class classification)


'''<code>torch.nn.BCEWithLogitsLoss</code>''' (Recommended for binary classification)
Internally applies a sigmoid before computing binary cross-entropy, offering better numerical stability.
<code>torch.nn.CrossEntropyLoss</code> (for multi-class classification)
==== 3. K-Nearest Neighbors (KNN) ====
==== 3. K-Nearest Neighbors (KNN) ====


'''Use Case''': Classification or regression tasks where the data is well-clustered or where instance-based (distance-based) reasoning is effective.
'''Use Case''': Classification or regression tasks where the data is well-clustered or where instance-based (distance-based) reasoning is effective.
'''Note''': KNN is a ''lazy learning'' algorithm and does not involve training in the conventional sense. There’s no standard optimizer or loss function because the model “trains” by storing data points and performing distance comparisons at inference time.
'''Note''': KNN is a ''lazy learning'' algorithm and does not involve training in the conventional sense. There’s no standard optimizer or loss function because the model “trains” by storing data points and performing distance comparisons at inference time.
==== 4. Support Vector Machine (SVM) ====
==== 4. Support Vector Machine (SVM) ====


'''Use Case''': Binary or multi-class classification where a maximum margin separating hyperplane is effective; can also be used for regression (SVR).
* '''Use Case''': Binary or multi-class classification where a maximum margin separating hyperplane is effective; can also be used for regression (SVR).
'''Optimizer''': <code>torch.optim.SGD</code>
* '''Optimizer''': <code>torch.optim.SGD</code>
'''Loss Function''': <code>torch.nn.HingeEmbeddingLoss</code> (implements the hinge loss typical for SVMs)
* '''Loss Function''': <code>torch.nn.HingeEmbeddingLoss</code> (implements the hinge loss typical for SVMs)
==== 5. Naive Bayes ====
==== 5. Naive Bayes ====


'''Use Case''': Classification tasks where features are conditionally independent (e.g., text classification, spam detection), leveraging probabilities derived from training data.
'''Use Case''': Classification tasks where features are conditionally independent (e.g., text classification, spam detection), leveraging probabilities derived from training data.
'''Note''': Naive Bayes is ''probabilistic'', based on Bayes' theorem. It does not require a traditional optimizer or loss function; instead, it calculates class probabilities from training data distributions.
'''Note''': Naive Bayes is ''probabilistic'', based on Bayes' theorem. It does not require a traditional optimizer or loss function; instead, it calculates class probabilities from training data distributions.
==== 6. Decision Trees ====
==== 6. Decision Trees ====


'''Use Case''': Classification or regression tasks where easily interpretable rules are desired, often used for quick, rule-based decisions.
'''Use Case''': Classification or regression tasks where easily interpretable rules are desired, often used for quick, rule-based decisions.
'''Note''': Decision trees use heuristic-based splitting criteria (e.g., Gini impurity, entropy) rather than explicit optimizers or loss functions.
'''Note''': Decision trees use heuristic-based splitting criteria (e.g., Gini impurity, entropy) rather than explicit optimizers or loss functions.
==== 7. Neural Networks ====
==== 7. Neural Networks ====


'''Use Case''': Extremely flexible models for a wide range of tasks—classification, regression, computer vision, NLP, and more, depending on architecture and configuration.
* '''Use Case''': Extremely flexible models for a wide range of tasks—classification, regression, computer vision, NLP, and more, depending on architecture and configuration.
'''Optimizer''':
 
** Common choices include <code>torch.optim.Adam</code>, <code>torch.optim.AdamW</code>, or <code>torch.optim.SGD</code>.
*'''Optimizer''':
'''Loss Function''' (task-dependent):
** Common choices include <code>torch.optim.AdamW</code>, or <code>torch.optim.SGD</code>.
 
*'''Loss Function''' (task-dependent):
** '''Regression''': <code>torch.nn.MSELoss</code>
** '''Regression''': <code>torch.nn.MSELoss</code>
** '''Binary Classification''': <code>torch.nn.BCEWithLogitsLoss</code>
** '''Binary Classification''': <code>torch.nn.BCEWithLogitsLoss</code>

Latest revision as of 23:37, 11 January 2025

Hardware Agnostic Code

This will run the model in the hardware that your system has aviable to it. Defaulting to CPU if neither Apple Silicon (mps), NVIDIA CUDA (cuda) or AMD Radeon Open Compute is found on the system.

device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "hip" if torch.backends.hip.is_available() else "cpu"

If using numpy plot data, you may run into device issues, as numpy uses cpu to do calculation. To fix this, cast the Tensor to cpu device, with .cpu(), e.g.:

tensor_var.cpu()


When going from numpy to tensors, will have to use .from_numpy(x), this is usually done when importing from skikit-learn e.g:

# Import necessary libraries
import sklearn
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split
import torch  # Ensure PyTorch is imported for tensor operations

# Define the number of samples
n_samples = 1000

# Generate a synthetic dataset of circles
X, y = make_circles(n_samples=n_samples,   # Total number of samples
                    noise=0.03,           # Add slight noise to the data
                    random_state=42)      # Ensures reproducibility

# Convert the dataset to PyTorch tensors
X = torch.from_numpy(X).type(torch.float)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,                # Features and labels
    test_size=0.2,       # 20% of data will be used for testing
    random_state=42      # Ensures reproducibility of the split
)

# Output shapes for verification
print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")
print(f"Training labels shape: {y_train.shape}")
print(f"Testing labels shape: {y_test.shape}")

Creating Valid Data

When creating sample data, you need at least a 2D tensor/matrix. Because machine learning models require a feature dimension. ie (n, 1) where n is some sample, and 1 is the corresponding feature.

As an example:

For a house: (Sample: n, Feature: 3)

  • Sample:
    1. A specific house.
  • Features:
  1. Size: 1500 square feet.
  2. Bedrooms: 3.
  3. Location Index: 2 (e.g., urban area).

This is usually done with the unsqueeze dim=1 property for a range. ie:

X = torch.arange(0, 1, 0.02).unsqueeze(dim=1)

The torch.arange creates a matrix of 50 samples, but no features - the unsqueeze at the first dimension adds 1 dimension to the tensor/matrix.

Setting the Algorithm

The Linear Regression algorithm is used for regression problems, for instance, creating a line for house price vs square footage, or house price and state the simplified representation of the linear relationship between X (the feature or input) and y (the target or output) using the linear regression equation:

y = weight * X + bias

Creating Training/Testing Split

Normally, when training, split the data 80/20. 80 for training, and 20 for testing.

Something like:

train_split = int(0.8 * len(X)) # this gets 80% of the current length of the dataset, and this needs to be an int
X_train, y_train = X[:train_split], y[:train_split] # : denotes the start of the index, up to 80%
X_test, y_test = X[train_split:], y[train_split:] # 80%: denotes at the end of the 80, to the end, which will be 20%

Creating/Inheriting Model class

When creating a model, you will need to import nn from torch, and in particular nn.Module.

Usually something like:

import torch
from torch import nn

You will have to subclass it, in a custom class, that uses the Module as a superclass.

class LinearRegressionModel(nn.Module): # nn.Module is the base class for all neural network modules in PyTorch, this is how it is inherited by the custom class
def __init__(self): # This is the constructor for the class, it is a way to initialize the class's attributes
super().__init__() # This is how we inherit from nn.Module, ensures the parent class’s constructor initializes properly.
self.weights = nn.Parameter(torch.randn(1,requires_grad=True,dtype=torch.float)) # creates a models parameter, using random
self.bias = nn.Parameter(torch.randn(1, requires_grad=True,dtype=torch.float)) # creates a models parameter, using random

def forward(self, x: torch.Tensor) -> torch.Tensor: # REQUIRED: Forward method is required for all nn.Module subclasses, it needs to override the forward method in the nn.Module class

return self.weights * x + self.bias (the Linear Regression)

Inside that you will need to initialize the the weights and biases, usually to random or zero, and set the forward loop. The forward loop is required.

After that is created, you will need to initialize the loss function, and the optimizer (and which paramars you are optimizing.)

Then, in the training loop, you will need to set the model to train mode, do a forward propagation, calculate the loss, set the gradient accumulation to zero, do the backward propagation, and then the step function.

Once this is done you can do a test, using model eval, and a forward pass on the test data, then calculate the loss, and see the results (on previously unseen data)

Loss Functions

For regression, you will want to use MAE nn.L1Loss(), or MSE nn.MSELoss().

For classification, you might want to use binary cross entropynn.BCELoss() or nn.BCEWithLogitsLoss() (recommended), or categorical cross entropynn.CrossEntropyLoss()which can also be used for mutli-class classification.

BCELoss does not have the sigmoid activation function, while BCEWithLogitsLoss combines the sigmoid activation. BCEWithLogitsLoss is more stable, because it has one layer, instead of first doing BCELoss, then Sigmoid in a different layer.

When Loss function is used, it must first be assigned, ie:

loss_fn = nn.CrossEntropyLoss()

loss = loss_fn(logits, train)

When doing the backward propagation, use:

loss.backward()

Optimizer

Use Stochastic Gradient Descent (SDG) Optimizer for Classification, Regression, and others torch.optim.SDG()

Use Adam Optimizer for Classification, Regression and others torch.optim.AdamW().

When optimizer is used it must be assigned:

optimizer = torch.optim.AdamW(params=model_4.parameters(),lr=0.01)

When doing the step function, do:

optimizer.step()

Zero Grad

When doing a looped forward, and backward propagation, you have to zero out the accumulated gradient descent by using:

optimizer.zero_grad()

Logits

In a machine learning model, logits are the raw, unprocessed outputs produced directly after a forward pass through the model. These logits are unnormalized values and require further processing to interpret as probabilities or predictions (labels).

Using BCEWithLogitsLoss

For binary classification tasks, it is common to use the BCEWithLogitsLoss loss function, which works directly with logits. This eliminates the need to manually apply a sigmoid activation during training, as the loss function internally computes it. However, for inference or evaluation, you must explicitly process the logits to obtain meaningful predictions.

Steps to Convert Logits to Predictions

1) Convert Logits to Probabilities: Use the sigmoid activation function to map logits to probabilities in the range [0,1]. This is done as:

probabilities = torch.sigmoid(logits)

2) Convert Probabilities to Labels: Apply a threshold (e.g., 0.5) to determine class labels. This can be done using rounding:

labels = torch.round(probabilities)

Or, combine the steps:

labels = torch.round(torch.sigmoid(logits))

Example Workflow

1) Perform a forward pass to obtain raw logits:

logits = model(inputs)

2) Apply the sigmoid activation to obtain probabilities:

probabilities = torch.sigmoid(logits)

3) Round probabilities to get binary class labels (0 or 1):

labels = torch.round(probabilities)

General Notes on Activation Functions

  • For binary classification, use the sigmoid function, as described above.
  • For multi-class classification, use the softmax function to convert logits into probabilities across multiple classes.

Types of Learning and their Optimal Algorithms

Supervised Learning

1. Linear Regression

  • Use Case: Modeling a continuous outcome based on one or more predictor variables, especially when the relationship is assumed to be roughly linear.
  • Optimizer: torch.optim.SGD (Stochastic Gradient Descent)
  • Loss Function: torch.nn.MSELoss (Mean Squared Error)

2. Logistic Regression

  • Use Case: Predicting binary or categorical outcomes, such as determining whether an email is spam or not.
  • Optimizer: torch.optim.SGD or torch.optim.AdamW
  • Loss Function:
    • torch.nn.BCELoss
      • Does not apply sigmoid, or round, has to be done manually. (Not recommended for modern models)
    • torch.nn.BCEWithLogitsLoss (Recommended for binary classification)
      • Internally applies a sigmoid before computing binary cross-entropy, offering better numerical stability.
    • torch.nn.CrossEntropyLoss (for multi-class classification)

3. K-Nearest Neighbors (KNN)

Use Case: Classification or regression tasks where the data is well-clustered or where instance-based (distance-based) reasoning is effective.

Note: KNN is a lazy learning algorithm and does not involve training in the conventional sense. There’s no standard optimizer or loss function because the model “trains” by storing data points and performing distance comparisons at inference time.

4. Support Vector Machine (SVM)

  • Use Case: Binary or multi-class classification where a maximum margin separating hyperplane is effective; can also be used for regression (SVR).
  • Optimizer: torch.optim.SGD
  • Loss Function: torch.nn.HingeEmbeddingLoss (implements the hinge loss typical for SVMs)

5. Naive Bayes

Use Case: Classification tasks where features are conditionally independent (e.g., text classification, spam detection), leveraging probabilities derived from training data.

Note: Naive Bayes is probabilistic, based on Bayes' theorem. It does not require a traditional optimizer or loss function; instead, it calculates class probabilities from training data distributions.

6. Decision Trees

Use Case: Classification or regression tasks where easily interpretable rules are desired, often used for quick, rule-based decisions.

Note: Decision trees use heuristic-based splitting criteria (e.g., Gini impurity, entropy) rather than explicit optimizers or loss functions.

7. Neural Networks

  • Use Case: Extremely flexible models for a wide range of tasks—classification, regression, computer vision, NLP, and more, depending on architecture and configuration.
  • Optimizer:
    • Common choices include torch.optim.AdamW, or torch.optim.SGD.
  • Loss Function (task-dependent):
    • Regression: torch.nn.MSELoss
    • Binary Classification: torch.nn.BCEWithLogitsLoss
    • Multi-class Classification: torch.nn.CrossEntropyLoss

Training Loop

“Timid Frogs Leap Gracefully Backwards Swiftly”

1.Timid - Set the model to train mode (model.train()).

2.Frogs - Perform the forward pass.

3.Leap - Compute the loss.

4.Gracefully - Zero the gradients (optimizer.zero_grad()).

5.Backwards - Execute backward propagation (loss.backward()).

6.Swiftly - Take an optimization step (optimizer.step()).