TF-Notes: Difference between revisions

From Depth Psychology Study Wiki
 
Line 5,237: Line 5,237:
* Use '''target network''':
* Use '''target network''':
** Maintain a separate Q-network for stable training.
** Maintain a separate Q-network for stable training.
** Update the target network periodically.
** Update the target network periodically(Less frequently than the primary Q-network).
* Use the '''Mean Squared Error (MSE)''' loss function with '''Adam optimizer'''.
* Use the '''Mean Squared Error (MSE)''' loss function with '''Adam optimizer'''.
----episodes = 100
----episodes = 100

Latest revision as of 22:52, 8 February 2025

Shallow Neural Networks v.s. Deep Neural Networks

Shallow Neural Networks:

  • Consist of one hidden layer (occasionally two, but rarely more).
  • Typically take inputs as feature vectors, where preprocessing transforms raw data into structured inputs.

Deep Neural Networks:

  • Consist of three or more hidden layers.
  • Handle larger numbers of neurons per layer depending on model design.
  • Can process raw, high-dimensional data such as images, text, and audio directly, often using specialized architectures like Convolutional Neural Networks (CNNs) for images or Recurrent Neural Networks (RNNs) for sequences.

Why deep learning took off: Advancements in the field

  • ReLU Activation Function:
    • Solves the vanishing gradient problem, enabling the training of much deeper networks.
  • Availability of More Data:
    • Deep learning benefits from large datasets which have become accessible due to the growth of the internet, digital media, and data collection tools.
  • Increased Computational Power:
    • GPUs and specialized hardware (e.g., TPUs) allow faster training of deep models.
    • What once took days or weeks can now be trained in hours or days.
  • Algorithmic Innovations:
    • Advancements such as batch normalization, dropout, and better weight initialization have improved the stability and efficiency of training.
  • Plateau of Conventional Machine Learning Algorithms:
    • Traditional algorithms often fail to improve after a certain data or model complexity threshold.
    • Deep learning continues to scale with data size, improving performance as more data becomes available.

In short:

The success of deep learning is due to advancements in the field, the availability of large datasets, and powerful computation.

CNNs (supervised tasks) vs. Traditional Neural Networks:

  • CNNs are considered deep neural networks. They are specialized architectures designed for tasks involving grid-like data, such as images and videos. Unlike traditional fully connected networks, CNNs utilize convolutions to detect spatial hierarchies in data. This enables them to efficiently capture patterns and relationships, making them better suited for image and visual data processing. Key Features of CNNs:
    • Input as Images: CNNs directly take images as input, which allows them to process raw pixel data without extensive manual feature engineering.
    • Efficiency in Training: By leveraging properties such as local receptive fields, parameter sharing, and spatial hierarchies, CNNs make the training process computationally efficient compared to fully connected networks.
    • Applications: CNNs excel at solving problems in image recognition, object detection, segmentation, and other computer vision tasks.
CNN Architecture.png

Convolutional Layers and ReLU

  • Convolutional Layers apply a filter (kernel) over the input data to extract features such as edges, textures, or patterns.
  • After the convolution operation, the resulting feature map undergoes a non-linear activation function, commonly ReLU (Rectified Linear Unit).

Pooling Layers

Pooling layers are used to down-sample the feature map, reducing its size while retaining important features.

Why Use Pooling?

  • Reduces dimensionality, decreasing computation and preventing overfitting.
  • Keeps the most significant information from the feature map.
  • Increases the spatial invariance of the network (i.e., the ability to recognize patterns regardless of location).

Keras Code

One Set of Convolutional and Pooling Layers:

model = Sequential()

model.add(Input(shape=(28, 28, 1)))

model.add(Conv2D(16, (5, 5), strides=(1, 1), activation='relu'))

model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

   

model.add(Flatten())

model.add(Dense(100, activation='relu'))

model.add(Dense(num_classes, activation='softmax'))

   

    # compile model

model.compile(optimizer='adam', loss='categorical_crossentropy',  metrics=['accuracy'])


Two Sets of Convolutional and Pooling Layers:

model = Sequential()

model.add(Input(shape=(28, 28, 1)))

model.add(Conv2D(16, (5, 5), activation='relu'))

model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

   

model.add(Conv2D(8, (2, 2), activation='relu'))

model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

   

model.add(Flatten())

model.add(Dense(100, activation='relu'))

model.add(Dense(num_classes, activation='softmax'))   

# Compile model

model.compile(optimizer='adam', loss='categorical_crossentropy',  metrics=['accuracy'])

Recurrent Neural Networks (supervised tasks)

  • Recurrent Neural Networks (RNNs) are a class of neural networks specifically designed to handle sequential data. They have loops that allow information to persist, meaning they don't just take the current input at each time step but also consider the output (or hidden state) from the previous time step. This enables them to process data where the order and context of inputs matter, such as time series, text, or audio.

How RNNs Work:

  1. RNN's.png
    Sequential Data Processing:
    • At each time step t, the network takes:
      • The current input xt​,
      • A bias bt​,
      • And the hidden state (output) from the previous time step, at−1​.
    • These are combined to compute the current hidden state at​.
  2. Recurrent Connection:
    • The feedback loop, as shown in the diagram, allows information to persist across time steps. This is the "memory" mechanism of RNNs.
    • The connection between at−1​ (previous state) and zt​ (current computation) captures temporal dependencies in sequential data.
  3. Activation Function:
    • The raw output zt​ passes through a non-linear activation function f (commonly tanh or ReLU) to compute at​, the current output or hidden state.

Long Short-Term Memory (LSTM) Model

Popular RNN variant (LSTM for short)

LSTMs are a type of Recurrent Neural Network (RNN) that solve the vanishing gradient problem, allowing them to capture long-term dependencies in sequential data. They achieve this through a memory cell and a system of gates that regulate the flow of information.


Applications Include:

  1. Image Generation:
    • LSTMs generate images pixel by pixel or feature by feature.
    • Often combined with Convolutional Neural Networks (CNNs) for spatial feature extraction.
  2. Handwriting Generation:
    • Learn sequences of strokes and predict future strokes to generate realistic handwriting.
    • Can mimic human writing styles given text input.
  3. Automatic Captioning for Images and Videos:
    • Combines CNNs (for extracting visual features) with LSTMs (for generating captions).
    • Example: "A group of people playing soccer in a field."

Key Features of LSTMs

  1. Memory Cell:
    • Stores information for long durations, addressing the short-term memory limitations of standard RNNs.
  2. Gating Mechanism:
    • Forget Gate: Decides which parts of the memory to discard.
    • Input Gate: Determines what new information to add to the memory.
    • Output Gate: Controls what part of the memory is output at the current step.
  3. Handles Long-Term Dependencies:
    • LSTMs can maintain context over hundreds of time steps, making them suitable for tasks like text generation, speech synthesis, and video processing.
  4. Flexible Sequence Modeling:
    • Suitable for variable-length input sequences, such as sentences, time-series data, or video frames.

Additional Notes

  • Variants:
    • GRUs (Gated Recurrent Units): A simplified version of LSTMs with fewer parameters, often faster to train but sometimes less effective for longer dependencies.
    • Bi-directional LSTMs: Process sequences in both forward and backward directions, capturing past and future context simultaneously.
  • Use Cases Beyond the Above:
    • Speech Recognition: Transcribe spoken words into text.
    • Music Generation: Compose music by predicting the sequence of notes.
    • Anomaly Detection: Detect unusual patterns in time-series data, like network intrusion or equipment failure.
  • Challenges:
    • LSTMs are computationally more expensive compared to simpler RNNs.
    • They require careful hyperparameter tuning (e.g., learning rate, number of layers, units per layer).

Autoencoders (unsupervised learning)

Autoencoder.png

Autoencoding is a type of data compression algorithm where the compression (encoding) and decompression (decoding) functions are learned automatically from the data, typically using neural networks.

Key Features of Autoencoders

  1. Unsupervised Learning:
    • Autoencoders are unsupervised neural network models.
    • They use backpropagation, but the input itself acts as the output label during training.
  2. Data-Specific:
    • Autoencoders are specialized for the type of data they are trained on and may not generalize well to data that is significantly different.
  3. Nonlinear Transformations:
    • Autoencoders can learn nonlinear representations, which makes them more powerful than techniques like Principal Component Analysis (PCA) that can only handle linear transformations.
  4. Applications:
    • Data Denoising: Removing noise from images, audio, or other data.
    • Dimensionality Reduction: Reducing the number of features in data while preserving key information, often used for visualization or preprocessing.

Types of Autoencoders

  • Standard Autoencoders:
    • Composed of an encoder-decoder structure with a bottleneck layer for compressed representation.
  • Denoising Autoencoders:
    • Designed to reconstruct clean data from noisy inputs.
  • Variational Autoencoders (VAEs):
    • A probabilistic extension of autoencoders that can generate new data points similar to the training data.
  • Sparse Autoencoders:
    • Include a sparsity constraint to learn compressed representations efficiently.

Clarifications on Specific Points

  1. Restricted Boltzmann Machines (RBMs):
    • While RBMs are related to autoencoders, they are not the same.
    • RBMs are generative models used for pretraining deep networks, and they are not autoencoders.
    • A more closely related concept is Deep Belief Networks (DBNs), which combine stacked RBMs.
  2. Fixing Imbalanced Datasets:
    • While autoencoders can be used to oversample minority classes by learning the structure of the minority class and generating synthetic samples, this is not their primary application.

Applications of Autoencoders

  1. Data Denoising:
    • Remove noise from images, audio, or other forms of data.
  2. Dimensionality Reduction:
    • Learn compressed representations for data visualization or feature extraction.
  3. Estimating Missing Values:
    • Reconstruct missing data by leveraging patterns in the dataset.
  4. Automatic Feature Extraction:
    • Extract meaningful features from unstructured data like text, images, or time-series data.
  5. Anomaly Detection:
    • Learn normal patterns in the data and identify deviations, useful for fraud detection, network intrusion, or manufacturing defect identification.
  6. Data Generation:
    • Variational autoencoders can generate new, similar data points, such as synthetic images.

Keras Advanced Features

  • Sequential API: Used for simple, linear stacks of layers.
  • Functional API: Designed for more flexibility and control, enabling the creation of intricate models such as:
    • Models with multiple inputs and outputs.
    • Shared layers that can be reused.
    • Models with non-sequential data flows, like multi-branch or residual networks.

Advantages of the Functional API:

  1. Flexibility: Enables construction of complex architectures like multi-branch and hierarchical models.
  2. Clarity: Provides an explicit and clear representation of the model structure.
  3. Reusability: Layers or models can be reused across different parts of the architecture.

Real-World Applications:

  1. Healthcare:
    • Medical image analysis for disease detection (e.g., pneumonia detection from chest X-rays).
  2. Finance:
    • Predicting market trends using time-series data.
  3. Autonomous Driving:
    • Object detection for pedestrians or vehicles.
    • Lane detection for road safety systems.
Keras Functional API and Subclassing API

from tensorflow.keras.models import Model

from tensorflow.keras.layers import Input, Dense

  1. Define the input

inputs = Input(shape=(784,))

  1. Define the layers

x = Dense(64, activation='relu')(inputs)

outputs = Dense(10, activation='softmax')(x)

  1. Create the model

model = Model(inputs=inputs, outputs=outputs)

model.summary()

Handling multiple inputs and outputs

from tensorflow.keras.models import Model

from tensorflow.keras.layers import Input, Dense, concatenate

# Define two sets of inputs

inputA = Input(shape=(64,))

inputB = Input(shape=(128,))

# The first branch operates on the first input

x = Dense(8, activation='relu')(inputA)

x = Dense(4, activation='relu')(x)

# The second branch operates on the second input

y = Dense(16, activation='relu')(inputB)

y = Dense(4, activation='relu')(y)

# Combine the outputs of the two branches

combined = concatenate([x, y])

# Add fully connected (FC) layers and a regression output

z = Dense(2, activation='relu')(combined)

z = Dense(1, activation='linear')(z)

# Create the model

model = Model(inputs=[inputA, inputB], outputs=z)

model.summary()

Shared layers and complex architectures

from tensorflow.keras.models import Model

from tensorflow.keras.layers import Input, Dense, Lambda

# Define the input layer

input = Input(shape=(28, 28, 1))

# Define a shared convolutional base

conv_base = Dense(64, activation='relu')

# Process the input through the shared layer

processed_1 = conv_base(input)

processed_2 = conv_base(input)

# Create a model using the shared layer

model = Model(inputs=input, outputs=[processed_1,processed_2])

model.summary()

Practical example: Implementing a complex model

from tensorflow.keras.models import Model

from tensorflow.keras.layers import Input, Dense, Lambda, Conv2D, MaxPooling2D, Flatten

from tensorflow.keras.activations import relu, linear

# First input model

inputA = Input(shape=(32, 32, 1))

x = Conv2D(32, (3, 3), activation=relu)(inputA)

x = MaxPooling2D((2, 2))(x)

x = Flatten()(x)

x = Model(inputs=inputA, outputs=x)

# Second input model

inputB = Input(shape=(32, 32, 1))

y = Conv2D(32, (3, 3), activation=relu)(inputB)

y = MaxPooling2D((2, 2))(y)

y = Flatten()(y)

y = Model(inputs=inputB, outputs=y)

# Combine the outputs of the two branches

combined = concatenate([x.output, y.output])

# Add fully connected (FC) layers and a regression output

z = Dense(64, activation='relu')(combined)

z = Dense(1, activation='linear')(z)

# Create the model

model = Model(inputs=[x.input, y.input], outputs=z)

model.summary()

Keras subclassing API
  • Offers flexibility
  • Defines custom and dynamic models
  • Implements subclassing of the model class and call method
  • Used in research and development for sturom training loops and non standard architectures.

import tensorflow as tf

# Define your model by subclassing

class MyModel(tf.keras.Model):

def __init__(self):

super(MyModel, self).__init__()

# Define layers

self.dense1 = tf.keras.layers.Dense(64, activation='relu')

self.dense2 = tf.keras.layers.Dense(10, activation='softmax')

def call(self, inputs):

# Forward pass

x = self.dense1(inputs)

y = self.dense2(x)

return y

# Instantiate the model

model = MyModel()

model.summary()

# Define loss function

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()

optimizer = tf.keras.optimizers.Adam()

# Compile the model

model.compile(optimizer=optimizer, loss=loss_fn, metrics=['accuracy'])

Use Cases for the Keras Subclassing API:

  1. Models with Dynamic Architecture:
    • The Subclassing API is ideal for creating models with architectures that vary dynamically based on the input or intermediate computations.
    • Example: Recurrent neural networks (RNNs) with variable sequence lengths or architectures that depend on runtime conditions.
  2. Custom Training Loops:
    • The Subclassing API offers fine-grained control over the training process, allowing you to implement custom training loops using GradientTape.
    • Example: Adversarial training (e.g., GANs) or reinforcement learning models where standard .fit() may not suffice.
  3. Research and Prototyping:
    • It is particularly suited for rapid experimentation in research scenarios, enabling you to define and test novel model architectures.
    • Example: Exploring new layer types, loss functions, or combining multiple types of inputs and outputs.
  4. Dynamic Graphs:
    • The Subclassing API allows you to implement eager execution with dynamic computation graphs, offering flexibility to perform different operations at each step of the forward pass.
    • Example: Models that use conditional logic or operations like if/else during the forward pass, such as graph neural networks or decision-tree-inspired neural networks.


Dropout Layers

Dropout is a regularization technique that randomly zeroes out a fraction of neuron outputs during each training step. This reduces the network’s reliance on specific neurons and fosters more robust feature learning—helping mitigate overfitting.

Training vs. Inference

  • Dropout is active only during training, where it stochastically drops neurons.
  • At inference (testing/prediction), dropout is effectively disabled (or rescaled to preserve expected outputs).

Dropout Rate (Hyperparameter)

  • Determines the probability of zeroing out each neuron’s output.
  • Common values range from 0.2 to 0.5, but it’s dataset and architecture dependent.

TensorFlow Keras Example

from tensorflow.keras.layers import Dropout

# Apply 50% dropout

dropout_layer = Dropout(rate=0.5)(hidden_layer)

Overall, dropout promotes generalization by preventing co-adaptation of neurons. It is an easy-to-use, widely adopted method for combating overfitting in deep networks.

Summary

  • Dropout randomly “turns off” neurons during training.
  • It is disabled at inference, ensuring consistent outputs.
  • The dropout rate sets the fraction of neurons dropped (typical ranges: 0.2–0.5).
  • TensorFlow (and other frameworks) make it straightforward to implement.

Batch Normalization

Batch Normalization (BatchNorm) stabilizes and accelerates neural network training by normalizing activations across the current mini-batch. This reduces internal covariate shift—i.e., shifts in input distributions over time—and often enables using larger learning rates for faster convergence.

  • Operation
    • Normalizes each channel to have near-zero mean and unit variance (based on the current mini-batch statistics).
    • During training, BatchNorm uses batch statistics; during inference, it uses running averages of mean and variance.
  • Learnable Scale & Shift
    • Each BatchNorm layer introduces two trainable parameters: γ (scale) and β (shift). These parameters allow the model to adjust or “undo” normalization if needed, preserving representational power.
  • TensorFlow Keras Example

from tensorflow.keras.layers import BatchNormalization

batch_norm_layer = BatchNormalization()(hidden_layer)


Overall, BatchNorm often yields faster training, improved stability, and can help networks converge in fewer epochs.

  • Summary
    • BatchNorm normalizes activations, reducing internal covariate shift.
    • It’s used both in training (via batch statistics) and inference (via running averages).
    • Introduces two learnable parameters for scaling and shifting post-normalization.
    • Can allow higher learning rates and faster convergence.

Keras Custom Layers

Custom layers allow developers to extend the functionality of deep learning frameworks and tailor models to specific needs:

  • Novel Research Ideas
    • Integrate new algorithms or experimental methods.
    • Rapidly prototype and evaluate cutting-edge techniques.
  • Performance Optimization
    • Fine-tune execution for specialized hardware or data structures.
    • Reduce memory or computational overhead by customizing layer operations.
  • Flexibility
    • Define unique behaviors not covered by standard library layers.
    • Experiment with unconventional architectures or parameterization.
  • Reusability & Maintenance
    • Bundle functionality into modular, readable components.
    • Simplify debugging and collaboration by centralizing custom code.
Basics of creating a custom layers

import tensorflow as tf

from tensorflow.keras.layers import Layer

class MyCustomLayer(Layer):

    def __init__(self, units=32, **kwargs):

        super(MyCustomLayer, self).__init__(**kwargs)

        self.units = units

    def build(self, input_shape):

        # Define trainable weights (e.g., kernel and bias)

        self.w = self.add_weight(

            shape=(input_shape[-1], self.units),

            initializer='random_normal',

            trainable=True

        )

        self.b = self.add_weight(

            shape=(self.units,),

            initializer='zeros',

            trainable=True

        )

    def call(self, inputs):

        # Forward pass using the custom weight parameters

        return tf.matmul(inputs, self.w) + self.b

print(tf.executing_eagerly()) # Should return True in TF 2.x

a = tf.constant([1, 2, 3])

b = tf.constant([4, 5, 6])

result = tf.add(a, b)

print(result)

# Outputs: tf.Tensor([5 7 9], shape=(3,), dtype=int32)

Custom dense Layer

class CustomDenseLayer(Layer):

    def __init__(self, units=32):

        super(CustomDenseLayer, self).__init__()

        self.units = units

    def build(self, input_shape):

        # Define trainable weights (e.g., kernel and bias)

        self.w = self.add_weight(

            shape=(input_shape[-1], self.units),

            initializer='random_normal',

            trainable=True

        )

        self.b = self.add_weight(

            shape=(self.units,),

            initializer='zeros',

            trainable=True

        )

    def call(self, inputs):

        # Forward pass using the custom weight parameters

        return tf.nn.relu(tf.matmul(inputs, self.w) + self.b)

import tensorflow.keras.model import Sequential

model = Sequential([

CustomDenseLayer(64),

CustomDenseLayer(10)

])

model.compile(optimizer='adam', loss='categorical_crossentropy')

TensorFlow 2.x

  • Eager Execution
    • Executes operations immediately rather than building a static graph.

Makes TensorFlow more intuitive and Pythonic, enabling straightforward/improved debugging.

  • Immediate Feedback
    • Facilitates interactive programming
    • Simplified Code
  • High-Level APIs
    • Simplifies model building via Keras, offering an approachable, layer-based API.
      • User Friendly
      • Modular and composable
      • Extensive documentation
    • Integrates seamlessly with eager execution to facilitate rapid experimentation.
  • Cross-Platform Support
    • Compatible with various hardware backends (e.g., CPU, GPU, TPU).
    • Embedded devices. (TensorFlow Lite)
    • Web (TensorFlow.js)
    • Production ML Pipelines (TensorFlow Extended (TFX))
  • Scalability & Performance
    • Optimizations under the hood handle large-scale training.
    • Suited to both research prototyping and production systems.
  • Rich Ecosystem
    • Extends TensorFlow with additional libraries (e.g., TensorFlow Extended, TensorFlow Lite, TensorFlow.js).
    • Offers a broad set of tools for data processing, visualization, and distributed training.
    • TensorFlow Hub
      • Repository for reusable machine learning modules.
    • TensorBoard
      • Visualization toolkit for TensorFlow

Convolutional Neural Networks (CNNs)

Purpose: CNNs analyze visual data by leveraging patterns in the spatial structure of images, mimicking the hierarchical structure of the human visual system.

Key Components of CNNs:

  1. Convolutional Layers
    • Perform feature extraction.
    • Learn spatial hierarchies of features through kernels (filters) applied to the input data.
    • Generate feature maps by convolving the input with a set of filters.
  2. Pooling Layers
    • Perform downsampling to reduce the spatial dimensions of feature maps.
    • Types:
      • Max Pooling: Retains the maximum value in a pooling window.
      • Average Pooling: Retains the average value in a pooling window.
    • Reduces computational complexity and prevents overfitting by acting as a form of regularization.
  3. Fully Connected (FC) Layers
    • Flatten the high-level features from convolutional and pooling layers into a single vector.
    • Use the vector for classification or regression tasks by passing it through dense layers and applying an activation function like softmax or sigmoid.

Workflow of CNNs:

  1. Input Image
    • Raw pixel data of the image is input into the network.
  2. Convolutional Layers
    • Extract low-level features like edges, corners, and textures. With deeper layers, higher-level features like shapes and objects are learned.
  3. Activation Function
    • Non-linear activation functions (e.g., ReLU) are applied after convolution to introduce non-linearity and enable the network to learn complex patterns.
  4. Pooling Layers
    • Reduce dimensionality while retaining key features. This decreases computational costs and improves model generalization.
  5. Fully Connected Layers
    • Use learned features to make final predictions. Typically followed by a softmax or sigmoid activation for classification.

Expanded CNN Architecture Example:

  • Input -> Convolution -> ReLU -> Pooling -> Convolution -> ReLU -> Pooling -> Fully Connected -> Output.

Additional Notes for Understanding:

  1. Padding
    • Used to preserve spatial dimensions after convolution. Common padding types are:
      • Valid Padding: No padding; reduces output size.
      • Same Padding: Pads input so that output dimensions match the input dimensions.
  2. Stride
    • The number of pixels by which the filter moves during convolution. Larger strides reduce the spatial size of the output.
  3. Regularization Techniques
    • Dropout: Randomly sets a fraction of input units to zero during training to prevent overfitting.
    • Batch Normalization: Normalizes intermediate activations to improve stability and training speed.
  4. Data Augmentation
    • Techniques like flipping, rotating, cropping, and scaling are used to increase the diversity of the training data and improve generalization.
  5. Applications of CNNs:
    • Image classification.
    • Object detection.
    • Semantic segmentation.
    • Medical imaging.
    • Style transfer.
  6. Popular Architectures:
    • LeNet: Early architecture for handwritten digit recognition.
    • AlexNet: Introduced deeper networks with ReLU and dropout.
    • VGGNet: Deep but simple networks with small filters.
    • ResNet: Introduced residual connections to combat vanishing gradients.
    • Inception: Used modules to increase depth and width efficiently.


Code Example: Basic CNN:


from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

# Create a Sequential model

model = Sequential([

    Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)),  # First Conv2D layer

    MaxPooling2D((2, 2)),  # First MaxPooling layer

    Conv2D(64, (3, 3), activation='relu'),  # Second Conv2D layer

    MaxPooling2D((2, 2)),  # Second MaxPooling layer

    Flatten(),  # Flatten the feature maps

    Dense(128, activation='relu'),  # Fully connected dense layer

    Dense(10, activation='softmax')  # Output layer for 10 classes

])

# Compile the model

model.compile(optimizer='adam',  # Use Adam optimizer

              loss='categorical_crossentropy',  # Loss function for multi-class classification

              metrics=['accuracy'])  # Metrics to monitor during training

# Print a summary of the model

model.summary()


  • Model Explanation:
    • Conv2D(32, (3, 3)): Adds a convolutional layer with 32 filters of size (3x3).
    • MaxPooling2D((2, 2)): Reduces spatial dimensions by taking the max value in a (2x2) window.
    • Flatten(): Flattens the 2D feature maps into a 1D vector.
    • Dense(128, activation='relu'): Fully connected layer with 128 units and ReLU activation.
    • Dense(10, activation='softmax'): Output layer for classification into 10 classes using softmax activation.
  • Input Shape:
    • input_shape=(64, 64, 3) assumes input images are 64x64 pixels with 3 color channels (RGB).

Advanced CNN Architectures

1. VGG (Visual Geometry Group Networks)

  • Key Idea: Uniform small 3×3 convolutional filters.
    • Max-Pooling Layers
    • Fully Connected Layers
  • Structure: Deep networks (e.g., VGG-16: 13 convolutional + 3 dense layers).
  • Pros: Simplicity and depth for feature extraction.
  • Cons: High computational cost.

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

# Create a Sequential model

model = Sequential([

    Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)),  # First Conv2D layer

    Conv2D(32, (3, 3), activation='relu'),  # Second Conv2D layer

    MaxPooling2D((2, 2)),  # First MaxPooling layer

    Conv2D(128, (3, 3), activation='relu'),  # Third Conv2D layer

    Conv2D(128, (3, 3), activation='relu'),  # Fourth Conv2D layer

    MaxPooling2D((2, 2)),  # Second MaxPooling layer

    Conv2D(256, (3, 3), activation='relu'),  # Fifth Conv2D layer

    Conv2D(256, (3, 3), activation='relu'),  # Sixth Conv2D layer

    MaxPooling2D((2, 2)),  # Third MaxPooling layer

    Flatten(),  # Flatten the feature maps

    Dense(512, activation='relu'),  # First Fully connected dense layer

    Dense(512, activation='relu'),  # Second Fully connected dense layer

    Dense(10, activation='softmax')  # Output layer for 10 classes

])

# Compile the model

model.compile(optimizer='adam',  # Use Adam optimizer

              loss='categorical_crossentropy',  # Loss function for multi-class classification

              metrics=['accuracy'])  # Metrics to monitor during training

# Print a summary of the model

model.summary()


Explanation of Key Components:

  1. Conv2D Layers:
    • 32,128,256 filters represent increasing feature extraction depth.
    • 3×3 kernels are used consistently for feature extraction.
  2. MaxPooling2D Layers:
    • Downsample the feature maps to reduce spatial dimensions and computational costs.
    • The MaxPooling2D((2,2)) layer in a Convolutional Neural Network (CNN) is used to reduce the spatial dimensions of the feature maps while preserving important features.
  3. Flatten Layer:
    • Converts the 3D feature maps into a 1D vector for dense layer input.
  4. Dense Layers:
    • Two fully connected layers with 512 neurons each, followed by a softmax layer for 10-class classification.
  5. Compilation:
    • Adam optimizer: Adaptive learning rate optimization for faster convergence.
    • Categorical Crossentropy: Suitable for multi-class classification problems.

2. ResNet (Residual Networks)

  • Key Idea: Residual connections (skip connections) to combat vanishing gradients.
  • Structure: Residual block: Output=F(Input)+Input.
  • Pros: Enables very deep networks (e.g., ResNet-50, ResNet-101).
  • Cons: Increased complexity.

from tensorflow.keras.layers import Input, Conv2D, BatchNormalization, Activation, Add, MaxPooling2D, Flatten, Dense, GlobalAveragePooling2D

from tensorflow.keras.models import Model

# Residual Block

def residual_block(x, filters, kernel_size=3, stride=1):

shortcut = x

    x = Conv2D(filters, kernel_size, strides=stride, padding='same')(x)

    x = BatchNormalization()(x)

    x = Activation('relu')(x)

    x = Conv2D(filters, kernel_size, strides=1, padding='same')(x)

    x = BatchNormalization()(x)

    x = Add()([x, shortcut]) 

    x = Activation('relu')(x)

    return x

# ResNet-like Model

def resnet_like(input_shape=(64, 64, 3), num_classes=10):

    inputs = Input(shape=input_shape)

    # Initial Conv Layer

    x = Conv2D(64, (7, 7), strides=2, padding='same')(inputs)

    x = BatchNormalization()(x)

    x = Activation('relu')(x)

    x = MaxPooling2D((3, 3), strides=2, padding='same')(x)

    # Residual Blocks

    x = residual_block(x, 64)

    x = residual_block(x, 64)

    x = residual_block(x, 128, stride=2)

    x = residual_block(x, 128)

    x = residual_block(x, 256, stride=2)

    x = residual_block(x, 256)

    # Global Average Pooling and Output

    x = GlobalAveragePooling2D()(x)

    x = Dense(num_classes, activation='softmax')(x)

    # Create Model

    model = Model(inputs, x)

    return model

# Create and compile the ResNet-like model

model = resnet_like()

model.compile(optimizer='adam',

              loss='categorical_crossentropy',

              metrics=['accuracy'])

# Print the model summary

model.summary()


3. Inception Networks

  • Key Idea: Multi-scale feature extraction with filters of different sizes (e.g., 1×1, 3×3, 5×5).
  • Structure: Inception modules use 1×1 bottleneck layers to reduce dimensionality.
  • Pros: Efficient and versatile.
  • Cons: Complex architecture design.

Data Augmentation Techniques

Purpose:

  • Enhance the robustness and generalization of models.
  • Artificially expand the training dataset by introducing variations.
  • Reduce overfitting by exposing the model to diverse examples.
  • Improve performance on unseen data by mimicking real-world variations.

Common Data Augmentation Techniques

  1. Rotation:
    • Rotate images by a random degree within a specified range (e.g., ±30°).
    • Helps the model learn invariance to orientation changes.
  2. Translations:
    • Shift the image horizontally or vertically by a certain number of pixels.
    • Useful for objects that might not always be centered in the frame.
  3. Flipping:
    • Horizontal flipping: Mirrors the image along the vertical axis.
    • Vertical flipping (less common): Mirrors the image along the horizontal axis.
    • Effective for symmetrical objects or scenes.
  4. Scaling:
    • Resize the image by a random factor, either zooming in or out.
    • Helps the model learn invariance to object sizes.
  5. Adding Noise:
    • Inject random noise (e.g., Gaussian noise, salt-and-pepper noise) to simulate variations in image quality.
    • Helps the model become robust to noisy or low-quality inputs.

Other Useful Techniques:

  1. Shearing:
    • Distort the image by slanting it along an axis.
    • Simulates different perspectives or viewing angles.
  2. Cropping:
    • Randomly crop parts of the image to simulate zoomed-in views or partial occlusions.
  3. Brightness, Contrast, and Color Adjustments:
    • Alter the brightness, contrast, saturation, or hue of the image.
    • Makes the model robust to varying lighting conditions.
  4. Random Erasing:
    • Randomly mask out parts of the image.
    • Forces the model to rely on other parts of the image for prediction.
  5. CutMix and MixUp:
    • CutMix: Combines parts of two images and mixes their labels.
    • MixUp: Mixes two images and their labels linearly.

from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Create an ImageDataGenerator instance with augmentation techniques

datagen = ImageDataGenerator(

    rotation_range=30,       # Rotate images by up to 30 degrees

    width_shift_range=0.2,   # Translate images horizontally by 20% of width

    height_shift_range=0.2,  # Translate images vertically by 20% of height

    horizontal_flip=True,    # Randomly flip images horizontally

    zoom_range=0.2,          # Zoom in/out by up to 20%

    brightness_range=[0.8, 1.2],  # Adjust brightness (80% to 120%)

    fill_mode='nearest'      # Fill missing pixels after transformations

)

# Example: Applying augmentations to a single image

from tensorflow.keras.preprocessing.image import img_to_array, load_img

image = load_img('example.jpg', target_size=(64, 64))  # Load image

image_array = img_to_array(image)  # Convert to array

image_array = image_array.reshape((1,) + image_array.shape)  # Add batch dimension

# Generate augmented images

i = 0

for batch in datagen.flow(image_array, batch_size=1):

plt.figure(i)

imgplot = plt.imshow(image.array to img(batch[0]))


ImageDataGenerator augments images by By rotating, shifting, and flipping images.

The fill_mode parameter in the ImageDataGenerator class specifies how the pixels outside the boundaries of an image are handled when transformations like rotation, shifting, or zooming are applied. These transformations can result in some parts of the image being "pushed out" of the frame, leaving empty areas. The fill_mode determines how these empty areas are filled.

The featurewise_center option in ImageDataGenerator performs feature-wise mean centering on the input data. When enabled, it normalizes the dataset by subtracting the mean value of each feature (e.g., each pixel across all images in the dataset) from the corresponding feature value in the images. (To set the mean of the dataset to 0)


Feature-Wise Normalization

  • Definition: Normalize the entire dataset feature-by-feature, based on the dataset's mean and standard deviation.
  • Use Case: Ensures all features have similar ranges, which helps models converge faster during training.
  • How It Works:
    • Compute the mean and standard deviation for each feature (e.g., pixel intensity) across all images in the dataset.
    • Subtract the mean and divide by the standard deviation for each feature.

datagen = ImageDataGenerator(featurewise_center=True, featurewise_std_normalization=True)

# Fit the generator on the dataset to compute statistics

datagen.fit(training_data)  # training_data is a NumPy array of images


Sample-Wise Normalization

  • Definition: Normalize each individual sample (image) independently by subtracting the sample's mean and dividing by its standard deviation.
  • Use Case: Useful when you want to normalize each image separately, especially if the dataset contains images with different intensity distributions.
  • How It Works:
    • For each image, compute its mean and standard deviation.
    • Normalize the image by subtracting its mean and dividing by its standard deviation.

datagen = ImageDataGenerator(samplewise_center=True, samplewise_std_normalization=True)

# Use the generator as part of the training process

datagen.flow(training_data, training_labels)


ImageDataGenerator.fit(training_images)

i = 0

for batch in ImageDataGenerator.flow(training_images, batch_size=32):

plt.figure(i)

imgplot = plt.imshow(image.array_ti_image(batch[0]))

i += 1

if i % 4 == 0

break

plt.show()


Custom Augmentation Functions

  • Definition: Define your own augmentation logic when the built-in transformations are insufficient or you need specialized augmentations.
  • How It Works:
    • Write a custom function that modifies the input data (e.g., applying specific transformations or injecting custom noise).
    • Pass the function as part of a data preprocessing pipeline.

import numpy as np

# Define a custom function to add random noise

def add_noise(image):

    noise = np.random.normal(loc=0, scale=0.1, size=image.shape)  # Gaussian noise

    return np.clip(image + noise, 0, 1)  # Ensure pixel values remain in [0, 1]

# Create a DataGenerator with a custom preprocessing function

datagen = ImageDataGenerator(preprocessing_function=add_noise)

# Use the generator

i = 0

for batch in datagen.flow(training_images, batch_size=32):

plt.figure(i)

imgplot = plt.imshow(image.array_ti_image(batch[0]))

i += 1

if i % 4 == 0

break

plt.show()

Transfer Learning in Keras

import sys

import os

import numpy as np

from tensorflow.keras.applications import VGG16

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Flatten

from tensorflow.keras.preprocessing.image import ImageDataGenerator

from PIL import Image

# Increase recursion limit (temporary solution for deep networks)

sys.setrecursionlimit(1000)

# Ensure the dataset directory exists and generate sample data if needed

def generate_sample_data():

    os.makedirs('training_data/class_1', exist_ok=True)

    os.makedirs('training_data/class_2', exist_ok=True)

    for i in range(10):

        img = Image.fromarray(np.random.randint(0, 255, (224, 224, 3), dtype=np.uint8))

        img.save(f'training_data/class_1/img_{i}.jpg')

        img.save(f'training_data/class_2/img_{i}.jpg')

# Generate sample data (Uncomment if needed)

# generate_sample_data()

# Load the VGG16 model pre-trained on ImageNet (without the top layers)

base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# What include_top=False Does

# Removes the Fully Connected (Dense) Layers from the original pretrained model.

# Keeps only the Convolutional Base (Feature Extractor).

# Allows Customization by letting you add your own fully connected layers for classification.

# Freeze the base model layers

for layer in base_model.layers:

    layer.trainable = False

# Create a new model and add the base model and new layers

model = Sequential([

    base_model,

    Flatten(),

    Dense(256, activation='relu'),

    Dense(1, activation='sigmoid')  # Change this for multi-class classification

])

# Compile the model

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Print model summary

model.summary()

# Load and preprocess the dataset

train_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(

    'training_data',

    target_size=(224, 224),

    batch_size=32,

    class_mode='binary'  # Fixed syntax error here (was `''binary``, now `'binary'`)

)

# Train the model

model.fit(train_generator, epochs=10)  # Fixed `epoch=10` to `epochs=10`

Fine-Tuning the Pre-trained Model VGG16

# unfreeze the top layers of the base model

for layer in base_model.layers[-4:]:

    layer.trainable = True

# Compile the model again

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model again

model.fit(train_generator, epochs=10)  # Fixed `epoch=10` to `epochs=10`

Using Pre-trained models

Using a pre-trained model for feature extraction.


import os

import shutil

from PIL import Image

import numpy as np

# Define the base directory for sample data

base_dir = 'sample_data'

class1_dir = os.path.join(base_dir, 'class1')

class2_dir = os.path.join(base_dir, 'class2')

# Create directories for two classes

os.makedirs(class1_dir, exist_ok=True)

os.makedirs(class2_dir, exist_ok=True)

# Function to generate and save random images

def generate_random_images(save_dir, num_images):

    for i in range(num_images):

        # Generate a random RGB image of size 224x224

        img_array = np.random.randint(0, 255, (224, 224, 3), dtype=np.uint8)

        img = Image.fromarray(img_array)

        img.save(os.path.join(save_dir, f"img_{i}.jpg"))

# Generate sample data for class1 and class2

generate_random_images(class1_dir, 100)

generate_random_images(class2_dir, 100)

print("Sample images generated successfully!")

Example: Pretrained model for extracting features

from tensorflow.keras.applications import VGG16

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Flatten

from tensorflow.keras.preprocessing.image import ImageDataGenerator

from tensorflow.keras.optimizers import Adam

# Load the VGG16 model pre-trained on ImageNet

base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Freeze all layers initially

for layer in base_model.layers:

    layer.trainable = False

# Create a new model and add the base model and new layers

model = Sequential([

    base_model,

    Flatten(),

    Dense(256, activation='relu'),

    Dense(1, activation='sigmoid')  # Change for multi-class classification if needed

])

# Compile the model

model.compile(optimizer=Adam(learning_rate=0.001),

              loss='binary_crossentropy',

              metrics=['accuracy'])

train_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(

    '/content/sample_data',  # Update with your dataset path

    target_size=(224, 224),

    batch_size=32,

    class_mode='binary'  # Use 'categorical' for multi-class classification

)

# Train the model with frozen layers

model.fit(train_generator, epochs=10)

# Gradually unfreeze layers and fine-tune

for layer in base_model.layers[-4:]:  # Unfreeze the last 4 layers

    layer.trainable = True

# Compile the model again with a lower learning rate for fine-tuning

model.compile(optimizer=Adam(learning_rate=0.0001),

              loss='binary_crossentropy',

              metrics=['accuracy'])

# Train the model again for fine-tuning

model.fit(train_generator, epochs=10)  # Fine-tune for additional epochs


# Modify data generator to include validation data

train_datagen = ImageDataGenerator(rescale=1./255, validation_split=0.2)

train_generator = train_datagen.flow_from_directory(

    'sample_data',

    target_size=(224, 224),

    batch_size=32,

    class_mode='binary',

    subset='training'

)

validation_generator = train_datagen.flow_from_directory(

    'sample_data',

    target_size=(224, 224),

    batch_size=32,

    class_mode='binary',

    subset='validation'

)

# Train the model with validation data

history = model.fit(train_generator, epochs=10, validation_data=validation_generator)

# Plot training and validation loss

plt.plot(history.history['loss'], label='Training Loss')

plt.plot(history.history['val_loss'], label='Validation Loss')

plt.title('Training and Validation Loss')

plt.xlabel('Epochs')

plt.ylabel('Loss')

plt.legend()

plt.show()

TensorFlow

Image manipulation tasks

classification

data augmentation


import tensorflow as tf

from tensorflow.keras.preprocessing.image import ImageDataGenerator, img_to_array, load_img

# Load and preprocess the image

img = load_img('/content/path_to_image.jpg', target_size=(224, 224))  # Load image with resizing

img_array = img_to_array(img)  # Convert image to NumPy array

img_array = tf.expand_dims(img_array, 0)  # Add batch dimension

# Display the original image

import matplotlib.pyplot as plt

plt.imshow(img)  # Show image

plt.show()

Data Augmentation techniques

from tensorflow.keras.preprocessing.image import ImageDataGenerator, img_to_array, load_img

import matplotlib.pyplot as plt

# Define ImageDataGenerator with augmentation parameters

datagen = ImageDataGenerator(

    rotation_range=40,          # Rotate images up to 40 degrees

    width_shift_range=0.2,      # Shift horizontally by 20% of width

    height_shift_range=0.2,     # Shift vertically by 20% of height

    shear_range=0.2,            # Shear transformation

    zoom_range=0.2,             # Zoom in/out by 20%

    horizontal_flip=True,       # Flip images horizontally

    fill_mode='nearest'         # Fill missing pixels with nearest values

)

# Load the image

img = load_img('/content/path_to_image.jpg')  # Update with actual image path

x = img_to_array(img)  # Convert image to array

x = x.reshape((1,) + x.shape)  # Add batch dimension

# Generate augmented images and display them

i = 0

for batch in datagen.flow(x, batch_size=1):

    plt.figure(i)

    imgplot = plt.imshow(tf.keras.preprocessing.image.array_to_img(batch[0]))  # Convert back to image format

    i += 1

    if i % 4 == 0: 

        break

plt.show()

Transpose Convolution

Transpose Convolution (Deconvolution) in Deep Learning

Transpose convolution (also called deconvolution or up-convolution) is used in deep learning to increase the spatial dimensions of a feature map. It is commonly used in tasks like:

Image Generation (e.g., GANs - Generative Adversarial Networks)

Super-Resolution (enhancing image resolution)

Semantic Segmentation (assigning pixel-wise labels to images, e.g., U-Net)

How Transpose Convolution Works

  1. Inserts zeros between input pixels
    • Expands the input feature map without learning new information.
  2. Applies a standard convolution operation
    • Uses a kernel (filter) to learn features and generate an upsampled output.
  3. Upsamples the feature map
    • The output size is larger than the input size, increasing spatial resolution.

Comparison: Normal vs. Transpose Convolution

Convolution (Downsampling) Transpose Convolution (Upsampling)
Reduces spatial dimensions Increases spatial dimensions
Extracts features Reconstructs spatial details
Used in CNN encoder (feature extraction) Used in CNN decoder (image reconstruction)

Use Cases of Transpose Convolution

Generative Adversarial Networks (GANs)

  • Used in DCGANs to generate high-resolution images. ✅ Super-Resolution (SRGANs, ESRGANs)
  • Enhances the quality of low-resolution images. ✅ Semantic Segmentation (U-Net, DeepLabV3+)
  • Converts feature maps back to full-resolution pixel-wise masks. ✅ Autoencoders (Variational Autoencoders - VAEs)
  • Used in decoder networks to reconstruct images.

import os

import logging

import tensorflow as tf

from tensorflow.keras.models import Model

from tensorflow.keras.layers import Input, Conv2DTranspose

# Set environment variables to suppress TensorFlow warnings

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # Ignore INFO, WARNING, and ERROR messages

os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'  # Turn off oneDNN custom operations

# Use logging to suppress TensorFlow warnings

logging.getLogger('tensorflow').setLevel(logging.ERROR)

# Define the input layer

input_layer = Input(shape=(28, 28, 1))  # Example: grayscale image input

# Add a transpose convolution layer (upsampling)

transpose_conv_layer = Conv2DTranspose(

    filters=32,

    kernel_size=(3, 3),

    strides=(2, 2),  # Upsamples the spatial dimensions

    padding='same',

    activation='relu'

)(input_layer)

# Define the output layer (final upsampling)

output_layer = Conv2DTranspose(

    filters=1,

    kernel_size=(3, 3),

    activation='sigmoid',

    padding='same'

)(transpose_conv_layer)

# Create the model

model = Model(inputs=input_layer, outputs=output_layer)

# Compile the model

model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])

# Print model summary

model.summary()


Issues with Transpose Convolution & How to Mitigate Them

When using Transpose Convolution (Conv2DTranspose), certain artifacts and issues can occur, particularly checkerboard artifacts caused by uneven kernel overlaps.

Issues in Transpose Convolution

  1. Checkerboard Artifacts
    • Occur due to uneven overlap when applying the transposed convolution.
    • Some pixels receive more updates than others, leading to an uneven pixel distribution.
    • Common in image generation (GANs), segmentation (U-Net), and super-resolution.
  2. Uneven Overlap of Convolution Kernels
    • When using large stride values, some pixels are affected more frequently than others.
    • This leads to distortions in the generated images.

How to Mitigate These Issues

1. Use Bilinear Upsampling Instead of Transpose Convolution

  • Apply bilinear interpolation first, then apply a regular convolution (Conv2D) to refine features.

import os

import logging

import tensorflow as tf

from tensorflow.keras.models import Model

from tensorflow.keras.layers import Input, UpSampling2D, Conv2D

# Suppress TensorFlow warnings

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # Ignore INFO, WARNING, and ERROR messages

os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'  # Turn off oneDNN custom operations

logging.getLogger('tensorflow').setLevel(logging.ERROR)

# Define the input layer

input_layer = Input(shape=(28, 28, 1))  # Example: grayscale input

# Apply upsampling followed by a convolution to prevent checkerboard artifacts

x = UpSampling2D(size=(2, 2))(input_layer)  # Upsample by a factor of 2

output_layer = Conv2D(

    filters=64,

    kernel_size=(3, 3),

    padding='same',

    activation='relu'  # Add ReLU for feature extraction

)(x)

# Create the model

model = Model(inputs=input_layer, outputs=output_layer)

# Compile the model

model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])

# Print model summary

model.summary()

Transformers

  • Used in image processing and time-series forecasting.
  • Introduced in "Attention Is All You Need" (Vaswani et al., 2017).
  • Self-attention mechanism enables parallel processing of input data.
  • Captures long-range dependencies more effectively than RNNs.
  • Forms the backbone of models like ViTs (Vision Transformers) and GPT.
Transformer.png
  • Encoder-Decoder Architecture
    • Encoder: Processes input sequences, extracting contextual embeddings.
    • Decoder: Generates output sequences, attending to encoder outputs.
  • Key Components
    • Self-Attention Layers: Assign importance to words/tokens, capturing dependencies.
    • Feedforward Layers: Transform input embeddings through dense layers.
    • Positional Encoding: Adds order awareness to input sequences.
    • Multi-Head Attention: Captures diverse contextual relationships.
  • Multiple Sub-Layers
    • Each encoder/decoder block contains self-attention + feedforward layers, enabling complex representations.
Self-attention component
  • Allows each input word to attend to all other words, capturing global context.
  • Uses Query (Q), Key (K), and Value (V) matrices for attention computation.
  • Steps:
    1. Compute attention scores: Dot product of Q and K.
    2. Scale scores: Divide by dk​​ to stabilize gradients.
    3. Apply Softmax: Converts scores into attention weights.
    4. Weight values (V) using attention scores to get final representation.
Attention.png
  • Enables parallel processing and better long-range dependencies than RNNs.

Code example: Self-attention calculation

import tensorflow as tf from tensorflow.keras.layers import Layer

class SelfAttention(Layer):

def init(self, d_model): super(SelfAttention, self).init() self.d_model = d_model

self.query_dense = tf.keras.layers.Dense(d_model)

self.key_dense = tf.keras.layers.Dense(d_model)

self.value_dense = tf.keras.layers.Dense(d_model)

def call(self, inputs):

q = self.query_dense(inputs)

k = self.key_dense(inputs)

v = self.value_dense(inputs)

attention_weights = tf.nn.softmax(tf.matmul(q, k, transpose_b=True) /

tf.math.sqrt(tf.cast(self.d_model, tf.float32)), axis=-1)

output = tf.matmul(attention_weights, v)

return output

# Example usage

inputs = tf.random.uniform((1, 60, 512)) # Batch size of 1, sequence length of 60, and model dimension of 512

self_attention = SelfAttention(d_model=512)

output = self_attention(inputs)

print(output.shape) # Should print (1, 60, 512)


Transformer Encoder

Composed of multiple stacked layers to process input sequences efficiently.

Key Components:

  • Self-Attention Mechanism → Enables each token to attend to all others, capturing dependencies.
  • Feedforward Neural Network → Applies transformations after self-attention for deeper feature extraction.
  • Residual Connections → Helps prevent vanishing gradients and stabilizes training.
  • Layer Normalization → Normalizes activations, improving convergence.
  • Positional Encoding → Adds sequence order information to input embeddings.

Process Flow:

  1. Input embedding → Converts tokens into vector representations.
  2. Positional Encoding → Injects position-related information.
  3. Self-Attention → Computes contextual relationships.
  4. Feedforward Layers → Applies learned transformations.
  5. Normalization & Residuals → Ensure stability and better gradient flow.

# Code example: Transformer encoder

import tensorflow as tf

from tensorflow.keras.layers import Layer, MultiHeadAttention, Dense, LayerNormalization, Dropout

import numpy as np

# Positional Encoding Layer

class PositionalEncoding(Layer):

    def __init__(self, max_length, d_model):

        super(PositionalEncoding, self).__init__()

        self.pos_encoding = self.positional_encoding(max_length, d_model)

    def positional_encoding(self, max_length, d_model):

        positions = np.arange(max_length)[:, np.newaxis]  # Shape: (max_length, 1)

        i = np.arange(d_model)[np.newaxis, :]  # Shape: (1, d_model)

        angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))

        angle_rads = positions * angle_rates

        angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])  # Apply sin to even indices

        angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])  # Apply cos to odd indices

        return tf.cast(angle_rads[np.newaxis, ...], dtype=tf.float32)  # Add batch dimension

    def call(self, x):

        return x + self.pos_encoding[:, :tf.shape(x)[1], :]

# Transformer Encoder Layer

class TransformerEncoder(Layer):

    def __init__(self, d_model, num_heads, dff, max_length, rate=0.1):

        super(TransformerEncoder, self).__init__()

        self.pos_encoding = PositionalEncoding(max_length, d_model)  # Add positional encoding

        self.mha = MultiHeadAttention(num_heads=num_heads, key_dim=d_model)

        self.ffn = tf.keras.Sequential([

            Dense(dff, activation='relu'),  # Expand feature space

            Dense(d_model)  # Project back to original size

        ])

        self.layernorm1 = LayerNormalization(epsilon=1e-6)

        self.layernorm2 = LayerNormalization(epsilon=1e-6)

        self.dropout1 = Dropout(rate)

        self.dropout2 = Dropout(rate)

    def call(self, x, training, mask=None):

        x = self.pos_encoding(x)  # Add positional encoding

        attn_output = self.mha(x, x, x, attention_mask=mask)  # Self-attention

        attn_output = self.dropout1(attn_output, training=training)

        out1 = self.layernorm1(x + attn_output)  # Residual + Norm

        ffn_output = self.ffn(out1)  # Feedforward Network

        ffn_output = self.dropout2(ffn_output, training=training)

        out2 = self.layernorm2(out1 + ffn_output)  # Residual + Norm

        return out2  # Output of Encoder Layer

# Example Usage

d_model = 512

num_heads = 8

dff = 2048

max_length = 60  # Max sequence length

encoder = TransformerEncoder(d_model=d_model, num_heads=num_heads, dff=dff, max_length=max_length)

x = tf.random.uniform((1, max_length, d_model))  # Batch of 1, 60 tokens, 512 dimensions

mask = None

output = encoder(x, training=True, mask=mask)

print(output.shape)  # Expected: (1, 60, 512)


Transformer decoder

Generates sequences based on context from the encoder.

Cross-Attention Mechanism → Attends to encoder outputs while generating target sequences.

Takes Target Sequence as Input → Uses previously generated tokens to predict the next.

Key Components:

  • Self-Attention → Allows each token to attend to previous tokens in the target sequence.
  • Cross-Attention → Attends to encoder outputs for context.
  • Feedforward Neural Network → Transforms attention outputs into meaningful representations.
  • Masked Self-Attention → Ensures predictions depend only on past tokens (prevents peeking).

Process Flow:

  1. Receives encoder output + target sequence.
  2. Self-attention captures relationships within the target sequence.
  3. Cross-attention aligns the generated sequence with encoder outputs.
  4. Feedforward layers refine the representation.
  5. Outputs probabilities for the next token in the sequence.

import tensorflow as tf

from tensorflow.keras.layers import Layer, MultiHeadAttention, Dense, LayerNormalization, Dropout

# Transformer Decoder Layer

class TransformerDecoder(Layer):

    def __init__(self, d_model, num_heads, dff, rate=0.1):

        super(TransformerDecoder, self).__init__()

        # Self-attention for target sequence

        self.mha1 = MultiHeadAttention(num_heads=num_heads, key_dim=d_model)

        # Cross-attention with encoder's output

        self.mha2 = MultiHeadAttention(num_heads=num_heads, key_dim=d_model)

        # Feedforward network

        self.ffn = tf.keras.Sequential([

            Dense(dff, activation='relu'),  # Expand feature space

            Dense(d_model)  # Project back to original size

        ])

        # Layer normalization and dropout layers

        self.layernorm1 = LayerNormalization(epsilon=1e-6)

        self.layernorm2 = LayerNormalization(epsilon=1e-6)

        self.layernorm3 = LayerNormalization(epsilon=1e-6)

        self.dropout1 = Dropout(rate)

        self.dropout2 = Dropout(rate)

        self.dropout3 = Dropout(rate)

    def call(self, x, encoder_output, training, look_ahead_mask=None, padding_mask=None):

        """

        x: Target sequence input

        encoder_output: Output from Transformer Encoder

        look_ahead_mask: Ensures decoder only attends to previous positions

        padding_mask: Masks padded positions in the input

        """

        # Self-attention with masking

        attn1 = self.mha1(x, x, x, attention_mask=look_ahead_mask)

        attn1 = self.dropout1(attn1, training=training)

        out1 = self.layernorm1(attn1 + x)  # Residual connection + normalization

        # Cross-attention with encoder output

        attn2 = self.mha2(out1, encoder_output, encoder_output, attention_mask=padding_mask)

        attn2 = self.dropout2(attn2, training=training)

        out2 = self.layernorm2(attn2 + out1)  # Residual connection + normalization

        # Feedforward network

        ffn_output = self.ffn(out2)

        ffn_output = self.dropout3(ffn_output, training=training)

        out3 = self.layernorm3(ffn_output + out2)  # Residual connection + normalization

        return out3  # Output of decoder layer

# Example usage

d_model = 512

num_heads = 8

dff = 2048

max_length = 60

decoder = TransformerDecoder(d_model=d_model, num_heads=num_heads, dff=dff)

x = tf.random.uniform((1, max_length, d_model))  # Batch of 1, 60 tokens, 512 dimensions

encoder_output = tf.random.uniform((1, max_length, d_model))  # Simulated encoder output

look_ahead_mask = None  # Typically generated dynamically

padding_mask = None

output = decoder(x, encoder_output, training=True, look_ahead_mask=look_ahead_mask, padding_mask=padding_mask)

print(output.shape)  # Expected: (1, 60, 512)


Transformers for sequential data

Defined by order and dependencies → Each element depends on previous elements.

Examples:

  • Natural Language Text → Words depend on context from previous words.
  • Time-Series Data → Stock prices, weather, and sensor data rely on past values.

Common Models for Sequential Data:

  • RNNs (Recurrent Neural Networks) → Handle short-term dependencies.
  • LSTMs & GRUs → Manage long-term dependencies via gating mechanisms.
  • Transformers → Use self-attention to capture dependencies without recurrence.

Handling of sequential data by transformers ✔ Uses Self-Attention Mechanisms → Captures dependencies without recurrence.

Handles Long-Range Dependencies → Unlike RNNs/LSTMs, Transformers don't suffer from vanishing gradients.

Supports Efficient Parallelization → Processes entire sequences simultaneously, unlike sequential RNNs.

Applications:

  • Natural Language Processing (NLP) → Machine translation, text generation (GPT, BERT).
  • Time-Series Forecasting → Stock predictions, demand forecasting, anomaly detection.

Building the Transformer Encoder:


import tensorflow as tf

from tensorflow.keras.layers import Layer, Dense, LayerNormalization, Dropout

class MultiHeadSelfAttention(Layer):

    def __init__(self, embed_dim, num_heads=8):

        super(MultiHeadSelfAttention, self).__init__()

        self.embed_dim = embed_dim

        self.num_heads = num_heads

        self.projection_dim = embed_dim // num_heads

        self.query_dense = Dense(embed_dim)

        self.key_dense = Dense(embed_dim)

        self.value_dense = Dense(embed_dim)

        self.combine_heads = Dense(embed_dim)


    def attention(self, query, key, value):

        score = tf.matmul(query, key, transpose_b=True)

        dim_key = tf.cast(tf.shape(key)[-1], tf.float32)

        scaled_score = score / tf.math.sqrt(dim_key)

        weights = tf.nn.softmax(scaled_score, axis=-1)

        output = tf.matmul(weights, value)

        return output, weights

    def split_heads(self, x, batch_size):

        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.projection_dim))

        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, inputs):

        batch_size = tf.shape(inputs)[0]

        query = self.query_dense(inputs)

        key = self.key_dense(inputs)

        value = self.value_dense(inputs)

        query = self.split_heads(query, batch_size)

        key = self.split_heads(key, batch_size)

        value = self.split_heads(value, batch_size)

        attention, _ = self.attention(query, key, value)

        attention = tf.transpose(attention, perm=[0, 2, 1, 3])

        concat_attention = tf.reshape(attention, (batch_size, -1, self.embed_dim))

        output = self.combine_heads(concat_attention)

        return output

class TransformerBlock(Layer):

    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):

        super(TransformerBlock, self).__init__()

        self.att = MultiHeadSelfAttention(embed_dim, num_heads)

        self.ffn = tf.keras.Sequential([

            Dense(ff_dim, activation="relu"),

            Dense(embed_dim),

        ])

        self.layernorm1 = LayerNormalization(epsilon=1e-6)

        self.layernorm2 = LayerNormalization(epsilon=1e-6)

        self.dropout1 = Dropout(rate)

        self.dropout2 = Dropout(rate)


    def call(self, inputs, training):

        attn_output = self.att(inputs)

        attn_output = self.dropout1(attn_output, training=training)

        out1 = self.layernorm1(inputs + attn_output)

        ffn_output = self.ffn(out1)

        ffn_output = self.dropout2(ffn_output, training=training)

        return self.layernorm2(out1 + ffn_output)

class TransformerEncoder(Layer):

    def __init__(self, num_layers, embed_dim, num_heads, ff_dim, rate=0.1):

        super(TransformerEncoder, self).__init__()

        self.num_layers = num_layers

        self.embed_dim = embed_dim

        self.enc_layers = [TransformerBlock(embed_dim, num_heads, ff_dim, rate) for _ in range(num_layers)]

        self.dropout = Dropout(rate)

    def call(self, inputs, training=False):

        x = inputs

        for i in range(self.num_layers):

            x = self.enc_layers[i](x, training=training)

        return x

# Example usage

embed_dim = 128

num_heads = 8

ff_dim = 512

num_layers = 4

transformer_encoder = TransformerEncoder(num_layers, embed_dim, num_heads, ff_dim)

inputs = tf.random.uniform((1, 100, embed_dim))

outputs = transformer_encoder(inputs, training=False)  # Use keyword argument for 'training'

print(outputs.shape)  # Should print (1, 100, 128)


Advanced Transformer Applications

Used in

Used in Computer Vision → Processes images without CNNs, leveraging self-attention.

Vision Transformers (ViTs)

  • Divides images into patches → Treats them as a sequence of tokens (like words in NLP).
  • Applies Transformer architecture → Captures long-range dependencies across patches.
  • Outperforms CNNs on large datasets (e.g., ImageNet).

Other Vision Transformer Variants

  • Swin Transformer → Uses hierarchical feature maps with shifted windows.
  • DETR (Detection Transformer) → Object detection using end-to-end attention.

import tensorflow as tf

from tensorflow.keras.layers import Layer, Dense, LayerNormalization, Dropout, MultiHeadAttention, Conv2D, Reshape, Flatten

from tensorflow.keras.models import Model

# ==========================================

# 🔹 Patch Embedding Layer (Converts Images to Patches)

# ==========================================

class PatchEmbedding(Layer):

    def __init__(self, img_size, patch_size, embedding_dim):

        """

        Converts an image into a sequence of patches and embeds them.

        Args:

        - img_size (int): Size of the input image (assumes square images).

        - patch_size (int): Size of each square patch.

        - embedding_dim (int): Dimensionality of the patch embeddings.

        """

        super(PatchEmbedding, self).__init__()

        self.num_patches = (img_size // patch_size) ** 2  # Calculate number of patches

        self.embedding_dim = embedding_dim

        # Convolutional layer extracts patches and embeds them

        self.projection = Conv2D(filters=embedding_dim,

                                 kernel_size=patch_size,

                                 strides=patch_size,

                                 padding='valid')

        # Reshape patches into a sequence

        self.flatten = Reshape((self.num_patches, embedding_dim))

    def call(self, images):

        """

        Forward pass: Extract patches and embed them.

        Args:

        - images (Tensor): Shape (batch_size, img_size, img_size, channels).

        Returns:

        - Tensor of shape (batch_size, num_patches, embedding_dim).

        """

        patches = self.projection(images)  # Extract and embed patches

        return self.flatten(patches)  # Flatten to sequence

# ==========================================

# 🔹 Transformer Block (Self-Attention + Feedforward)

# ==========================================

class TransformerBlock(Layer):

    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):

        """

        Transformer Encoder Block for Self-Attention and Feedforward Network.

        Args:

        - embed_dim (int): Dimension of the input embeddings.

        - num_heads (int): Number of attention heads.

        - ff_dim (int): Dimension of the feedforward network.

        - rate (float): Dropout rate.

        """

        super(TransformerBlock, self).__init__()

        # Multi-head self-attention mechanism

        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)

        # Feedforward network (expansion + projection)

        self.ffn = tf.keras.Sequential([

            Dense(ff_dim, activation="relu"),  # Expand feature space

            Dense(embed_dim),  # Project back to original size

        ])

        # Layer normalization and dropout layers

        self.layernorm1 = LayerNormalization(epsilon=1e-6)

        self.layernorm2 = LayerNormalization(epsilon=1e-6)

        self.dropout1 = Dropout(rate)

        self.dropout2 = Dropout(rate)

    def call(self, inputs, training, mask=None):

        """

        Forward pass for the Transformer block.

        Args:

        - inputs (Tensor): Input tensor of shape (batch_size, num_patches, embed_dim).

        - training (bool): Whether the model is in training mode.

        - mask (Tensor, optional): Attention mask.

        Returns:

        - Tensor with the same shape as input (batch_size, num_patches, embed_dim).

        """

        # Apply self-attention

        attn_output = self.att(inputs, inputs, inputs, attention_mask=mask)

        attn_output = self.dropout1(attn_output, training=training)

        out1 = self.layernorm1(inputs + attn_output)  # Residual connection + normalization

        # Apply feedforward network

        ffn_output = self.ffn(out1)

        ffn_output = self.dropout2(ffn_output, training=training)

        return self.layernorm2(out1 + ffn_output)  # Residual connection + normalization

# ==========================================

# 🔹 Patch Extraction Function

# ==========================================

def extract_patches(self, images):

    """

    Extracts non-overlapping patches from the input image.

    Args:

    - images (Tensor): Input image tensor of shape (batch_size, img_size, img_size, channels).

    Returns:

    - Tensor of shape (batch_size, num_patches, patch_size * patch_size * channels).

    """

    batch_size = tf.shape(images)[0]  # Get batch size dynamically

    # Extract fixed-size patches using TensorFlow's built-in function

    patches = tf.image.extract_patches(

        images=images,  # Input image tensor

        sizes=[1, 16, 16, 1],  # Patch size (16x16)

        strides=[1, 16, 16, 1],  # Move 16 pixels each step (non-overlapping patches)

        rates=[1, 1, 1, 1],  # No dilation (standard patches)

        padding='VALID'  # No padding applied

    )

    # Reshape patches into sequence format

    patches = tf.reshape(patches, [batch_size, -1, 16 * 16 * 3])

    return patches  # Returns extracted patches as a sequence

# ==========================================

# 🔹 Vision Transformer Model

# ==========================================

class VisionTransformer(Model):

    def __init__(self, img_size, patch_size, embedding_dim, num_heads, ff_dim, num_layers, num_classes):

        """

        Vision Transformer (ViT) model.

        Args:

        - img_size (int): Size of the input image (assumes square images).

        - patch_size (int): Size of each patch.

        - embedding_dim (int): Dimensionality of patch embeddings.

        - num_heads (int): Number of attention heads.

        - ff_dim (int): Feedforward layer dimension.

        - num_layers (int): Number of Transformer blocks.

        - num_classes (int): Number of output classes.

        """

        super(VisionTransformer, self).__init__()

        # Patch embedding layer (Converts images into patch embeddings)

        self.patch_embed = PatchEmbedding(img_size, patch_size, embedding_dim)

        # Stack multiple Transformer blocks

        self.transformer_layers = [TransformerBlock(embedding_dim, num_heads, ff_dim) for _ in range(num_layers)]

        # Classification head

        self.flatten = Flatten()

        self.dense = Dense(num_classes, activation='softmax')

    def call(self, images, training):

        """

        Forward pass for the Vision Transformer model.

        Args:

        - images (Tensor): Input image tensor of shape (batch_size, img_size, img_size, channels).

        - training (bool): Indicates whether the model is in training mode.

        Returns:

        - Tensor of shape (batch_size, num_classes) with class probabilities.

        """

        patches = self.patch_embed(images)  # Convert image to patches

        for transformer_layer in self.transformer_layers:

            patches = transformer_layer(patches, training=training)  # Apply Transformer blocks

        x = self.flatten(patches)  # Flatten to feed into classification head

        return self.dense(x)  # Output class probabilities

# ==========================================

# 🔹 Example Usage: Vision Transformer

# ==========================================

num_patches = 196  # Assuming 14x14 patches

embedding_dim = 128

num_heads = 4

ff_dim = 512

num_layers = 6

num_classes = 10  # For CIFAR-10 dataset

# Instantiate Vision Transformer model

vit = VisionTransformer(img_size=224, patch_size=16, embedding_dim=embedding_dim,

                        num_heads=num_heads, ff_dim=ff_dim, num_layers=num_layers, num_classes=num_classes)

# Generate a batch of random input images (batch_size=32, image_size=224x224, 3 color channels)

images = tf.random.uniform((32, 224, 224, 3))  # Batch of 32 images of size 224x224

# Forward pass through the Vision Transformer model

output = vit(images)

# Print the shape of the model output (should be (32, 10) for 10-class classification)

print(output.shape)  # Expected output: (32, 10)


Speech recognition

Converts Audio Signals into Spectrograms → Transforms raw waveforms into visual time-frequency representations.

Processes the Sequential Nature of Speech Data → Uses self-attention to model dependencies without recurrence (unlike RNNs).

🔹 Examples of Transformer-Based Speech Models

  • Wav2Vec → Learns speech representations from raw audio using unsupervised pretraining.
  • Speech Transformer → Adapts the Transformer architecture for end-to-end speech recognition.
  • Whisper (by OpenAI) → Multilingual speech recognition and transcription.

import tensorflow as tf

from tensorflow.keras.layers import Layer, Dense, LayerNormalization, Dropout, MultiHeadAttention, Flatten, Conv1D, BatchNormalization, Reshape

from tensorflow.keras.models import Model

# ==========================================

# 🔹 Transformer Block (Self-Attention + Feedforward)

# ==========================================

class TransformerBlock(Layer):

    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):

        """

        Transformer block consisting of Multi-Head Self-Attention and a Feedforward Network.

        Args:

        - embed_dim (int): Embedding dimension of input data.

        - num_heads (int): Number of attention heads.

        - ff_dim (int): Hidden layer size in the feedforward network.

        - rate (float): Dropout rate for regularization.

        """

        super(TransformerBlock, self).__init__()

        # Multi-Head Self-Attention

        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)

        # Feedforward network (expansion + projection)

        self.ffn = tf.keras.Sequential([

            Dense(ff_dim, activation="relu"),  # Expand feature space

            Dense(embed_dim)  # Project back to original size

        ])

        # Layer normalization and dropout layers

        self.layernorm1 = LayerNormalization(epsilon=1e-6)

        self.layernorm2 = LayerNormalization(epsilon=1e-6)

        self.dropout1 = Dropout(rate)

        self.dropout2 = Dropout(rate)

    def call(self, inputs, training, mask=None):

        """

        Forward pass through the Transformer block.

        Args:

        - inputs (Tensor): Input tensor of shape (batch_size, seq_length, embed_dim).

        - training (bool): Whether the model is in training mode.

        - mask (Tensor, optional): Mask for attention.

        Returns:

        - Tensor of shape (batch_size, seq_length, embed_dim).

        """

        attn_output = self.att(inputs, inputs, inputs, attention_mask=mask)  # Self-attention

        attn_output = self.dropout1(attn_output, training=training)

        out1 = self.layernorm1(inputs + attn_output)  # Residual connection + normalization

        ffn_output = self.ffn(out1)  # Feedforward network

        ffn_output = self.dropout2(ffn_output, training=training)

        return self.layernorm2(out1 + ffn_output)  # Residual connection + normalization

# ==========================================

# 🔹 Patch Embedding Layer (For Speech Data)

# ==========================================

class PatchEmbedding(Layer):

    def __init__(self, num_patches, embedding_dim):

        """

        Patch Embedding layer to project extracted spectrogram patches into embeddings.

        Args:

        - num_patches (int): Number of extracted patches.

        - embedding_dim (int): Embedding dimension for the transformer.

        """

        super(PatchEmbedding, self).__init__()

        self.num_patches = num_patches

        self.embedding_dim = embedding_dim

        self.projection = Dense(embedding_dim)  # Fully connected layer to project patches

    def call(self, patches):

        """

        Forward pass: Projects patches into embedding space.

        Args:

        - patches (Tensor): Shape (batch_size, num_patches, patch_dim).

        Returns:

        - Tensor of shape (batch_size, num_patches, embedding_dim).

        """

        return self.projection(patches)

# ==========================================

# 🔹 Speech Transformer Model

# ==========================================

class SpeechTransformer(Model):

    def __init__(self, num_mel_bins, embedding_dim, num_heads, ff_dim, num_layers, num_classes):

        """

        Speech Transformer model for audio classification tasks.

        Args:

        - num_mel_bins (int): Number of Mel bins in the spectrogram.

        - embedding_dim (int): Embedding size for Transformer.

        - num_heads (int): Number of attention heads.

        - ff_dim (int): Feedforward network dimension.

        - num_layers (int): Number of Transformer layers.

        - num_classes (int): Number of output classes.

        """

        super(SpeechTransformer, self).__init__()

        # Convolutional layer for feature extraction from spectrograms

        self.conv1 = Conv1D(filters=embedding_dim, kernel_size=3, strides=1, padding='same', activation='relu')

        # Batch normalization for stabilization

        self.batch_norm = BatchNormalization()

        # Reshape layer to transform spectrogram data into transformer-compatible format

        self.reshape = Reshape((-1, embedding_dim))

        # Stacking multiple Transformer layers

        self.transformer_layers = [TransformerBlock(embedding_dim, num_heads, ff_dim) for _ in range(num_layers)]

        # Flatten and classification head

        self.flatten = Flatten()

        self.dense = Dense(num_classes, activation='softmax')

    def call(self, spectrograms):

        """

        Forward pass for the Speech Transformer model.

        Args:

        - spectrograms (Tensor): Input tensor of shape (batch_size, time_steps, num_mel_bins).

        Returns:

        - Tensor of shape (batch_size, num_classes) with class probabilities.

        """

        x = self.conv1(spectrograms)  # Apply initial Conv1D for feature extraction

        x = self.batch_norm(x)  # Normalize features

        x = self.reshape(x)  # Reshape to sequence for Transformer processing

        for transformer_layer in self.transformer_layers:

            x = transformer_layer(x)  # Pass through Transformer layers

        x = self.flatten(x)  # Flatten output

        return self.dense(x)  # Apply classification head

# ==========================================

# 🔹 Example Usage: Speech Transformer

# ==========================================

num_mel_bins = 80  # Number of Mel-frequency bins in spectrogram

embedding_dim = 128  # Embedding dimension for Transformer

num_heads = 4  # Number of self-attention heads

ff_dim = 512  # Dimension of feedforward network

num_layers = 6  # Number of Transformer layers

num_classes = 30  # Example: Phoneme classification (30 classes)

# Instantiate Speech Transformer model

st = SpeechTransformer(num_mel_bins, embedding_dim, num_heads, ff_dim, num_layers, num_classes)

# Generate example batch of spectrograms (batch_size=32, time_steps=100, mel_bins=80)

spectrograms = tf.random.uniform((32, 100, num_mel_bins))  # Batch of 32 spectrograms, 100 time frames

# Get model predictions

output = st(spectrograms, training=True)

# Print the shape of the model output (should be (32, 30) for batch size 32 and 30 classes)

print(output.shape)  # Expected output: (32, 30)


Reinforcement Learning (RL) with Transformers

Models complex dependencies in sequences of states and actions.

Optimizes decision-making through reward-based learning.

🔹 Decision Transformers (DTs)

  • Leverage Transformer architecture to predict optimal actions.
  • Use past trajectories (state-action-reward history) for decision-making.
  • Apply self-attention to capture long-term dependencies.
  • Unlike traditional RL (which optimizes a policy), DTs treat RL as a sequence modeling problem.

🔹 Applications of RL with Transformers

  • Game AI (AlphaStar, OpenAI Five).
  • Robotics (decision-making in real-world tasks).
  • Autonomous Vehicles (route planning and navigation).
  • Trading & Finance (portfolio optimization).

import tensorflow as tf

from tensorflow.keras.layers import Layer, Dense, LayerNormalization, Dropout, MultiHeadAttention, TimeDistributed

from tensorflow.keras.models import Model

# ==========================================

# 🔹 Transformer Block (Self-Attention + Feedforward)

# ==========================================

class TransformerBlock(Layer):

    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):

        """

        Transformer block consisting of Multi-Head Self-Attention and a Feedforward Network.

        Args:

        - embed_dim (int): Embedding dimension of input data.

        - num_heads (int): Number of attention heads.

        - ff_dim (int): Hidden layer size in the feedforward network.

        - rate (float): Dropout rate for regularization.

        """

        super(TransformerBlock, self).__init__()

        # Multi-Head Self-Attention

        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)

        # Feedforward network (expansion + projection)

        self.ffn = tf.keras.Sequential([

            Dense(ff_dim, activation="relu"),  # Expand feature space

            Dense(embed_dim)  # Project back to original size

        ])

        # Layer normalization and dropout layers

        self.layernorm1 = LayerNormalization(epsilon=1e-6)

        self.layernorm2 = LayerNormalization(epsilon=1e-6)

        self.dropout1 = Dropout(rate)

        self.dropout2 = Dropout(rate)

    def call(self, inputs, training, mask=None):

        """

        Forward pass through the Transformer block.

        Args:

        - inputs (Tensor): Input tensor of shape (batch_size, seq_length, embed_dim).

        - training (bool): Whether the model is in training mode.

        - mask (Tensor, optional): Mask for attention.

        Returns:

        - Tensor of shape (batch_size, seq_length, embed_dim).

        """

        attn_output = self.att(inputs, inputs, inputs, attention_mask=mask)  # Self-attention

        attn_output = self.dropout1(attn_output, training=training)

        out1 = self.layernorm1(inputs + attn_output)  # Residual connection + normalization

        ffn_output = self.ffn(out1)  # Feedforward network

        ffn_output = self.dropout2(ffn_output, training=training)

        return self.layernorm2(out1 + ffn_output)  # Residual connection + normalization

# ==========================================

# 🔹 Decision Transformer Model for RL

# ==========================================

class DecisionTransformer(Model):

    def __init__(self, state_dim, action_dim, embedding_dim, num_heads, ff_dim, num_layers):

        """

        Decision Transformer model for Reinforcement Learning.

        Args:

        - state_dim (int): Dimension of state input.

        - action_dim (int): Dimension of action output.

        - embedding_dim (int): Embedding size for Transformer.

        - num_heads (int): Number of attention heads.

        - ff_dim (int): Feedforward network dimension.

        - num_layers (int): Number of Transformer layers.

        """

        super(DecisionTransformer, self).__init__()

        # State and Action Embeddings

        self.state_embed = Dense(embedding_dim, activation='relu')

        self.action_embed = Dense(embedding_dim, activation='relu')

        # Stacking multiple Transformer layers

        self.transformer_layers = [TransformerBlock(embedding_dim, num_heads, ff_dim) for _ in range(num_layers)]

        # Output layer (TimeDistributed to maintain sequence output shape)

        self.dense = TimeDistributed(Dense(action_dim))

    def call(self, states, actions):

        """

        Forward pass for the Decision Transformer.

        Args:

        - states (Tensor): Input tensor of shape (batch_size, seq_length, state_dim).

        - actions (Tensor): Input tensor of shape (batch_size, seq_length, action_dim).

        Returns:

        - Tensor of shape (batch_size, seq_length, action_dim) with predicted actions.

        """

        # Embed states and actions

        state_embeddings = self.state_embed(states)

        action_embeddings = self.action_embed(actions)

        # Combine state and action embeddings

        x = state_embeddings + action_embeddings

        # Pass through Transformer layers

        for transformer_layer in self.transformer_layers:

            x = transformer_layer(x)

        return self.dense(x)  # Predict next action probabilities

# ==========================================

# 🔹 Example Usage: Decision Transformer

# ==========================================

# Define model hyperparameters

state_dim = 20  # State representation size

action_dim = 5  # Action space size

embedding_dim = 128  # Embedding dimension

num_heads = 4  # Number of self-attention heads

ff_dim = 512  # Feedforward network size

num_layers = 6  # Number of Transformer layers

# Instantiate Decision Transformer model

dt = DecisionTransformer(state_dim, action_dim, embedding_dim, num_heads, ff_dim, num_layers)

# Generate example batch of states and actions (batch_size=32, seq_length=100)

states = tf.random.uniform((32, 100, state_dim))  # Batch of 32 sequences of 100 states

actions = tf.random.uniform((32, 100, action_dim))  # Corresponding action sequences

# Get model predictions

output = dt(states, actions, training=True)

# Print the shape of the model output (should be (32, 100, 5) for batch size 32, sequence length 100, and action dimension 5)

print(output.shape)  # Expected output: (32, 100, 5)


Advantages of Transformers for Time-Series Prediction

Capture Long-Range Dependencies → The self-attention mechanism allows models to learn patterns across long time horizons, unlike RNNs/LSTMs which struggle with vanishing gradients.

Transformer for time series prediction.png

Enable Parallelization → Unlike sequential models (RNNs, LSTMs), Transformers process entire sequences simultaneously, leading to faster training and inference.

Handle Variable-Length Sequences → No fixed input size requirement, making Transformers flexible for irregular or dynamic time-series data.

Robust to Missing DataSelf-attention helps focus on available information rather than relying on strictly ordered sequences, making them resilient to gaps in data.


import tensorflow as tf

from tensorflow.keras.layers import Layer, Dense, LayerNormalization, Dropout, MultiHeadAttention

# ==========================================

# 🔹 Transformer Block (Self-Attention + Feedforward)

# ==========================================

class TransformerBlock(Layer):

    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):

        """

        Transformer block consisting of Multi-Head Self-Attention and a Feedforward Network.

        Args:

        - embed_dim (int): Embedding dimension of input data.

        - num_heads (int): Number of attention heads.

        - ff_dim (int): Hidden layer size in the feedforward network.

        - rate (float): Dropout rate for regularization.

        """

        super(TransformerBlock, self).__init__()

        # Multi-Head Self-Attention layer

        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)

        # Feedforward network (expansion + projection)

        self.ffn = tf.keras.Sequential([

            Dense(ff_dim, activation="relu"),  # Expands feature space

            Dense(embed_dim)  # Projects back to original embedding size

        ])

        # Layer normalization for stabilizing training

        self.layernorm1 = LayerNormalization(epsilon=1e-6)

        self.layernorm2 = LayerNormalization(epsilon=1e-6)

        # Dropout layers for regularization

        self.dropout1 = Dropout(rate)

        self.dropout2 = Dropout(rate)

    def call(self, inputs, training, mask=None):

        """

        Forward pass through the Transformer block.

        Args:

        - inputs (Tensor): Input tensor of shape (batch_size, seq_length, embed_dim).

        - training (bool): Whether the model is in training mode.

        - mask (Tensor, optional): Mask for self-attention.

        Returns:

        - Tensor of shape (batch_size, seq_length, embed_dim).

        """

        # Apply Multi-Head Self-Attention

        attn_output = self.att(inputs, inputs, inputs, attention_mask=mask)

        attn_output = self.dropout1(attn_output, training=training)

        # Residual connection + Layer Normalization

        out1 = self.layernorm1(inputs + attn_output)

        # Apply Feedforward Network

        ffn_output = self.ffn(out1)

        ffn_output = self.dropout2(ffn_output, training=training)

        # Residual connection + Layer Normalization

        return self.layernorm2(out1 + ffn_output)


Preparing Time-Series Data for Transformers

Normalize the Data → Scale stock prices (or any time-series data) to a range (e.g., [0,1] or [-1,1]) for stable training.

Create Sequences of Data Points as Input → Convert raw time-series data into fixed-length input sequences.

Use a Specified Number of Time Steps (T) → Each sequence consists of T previous time steps as input.

Label Each Sequence with the Next Value in the Series → The target (label) for each sequence is the next actual value in the time series.


import numpy as np

import pandas as pd

from sklearn.preprocessing import MinMaxScaler

// ==========================================

// 🔹 Load and Normalize the Stock Price Data

// ==========================================

// Load the dataset (ensure it contains a 'Close' column for stock prices)

data = pd.read_csv('/content/stock_prices.csv')

// Select the 'Close' column and convert to numpy array

data = data'Close'.values

// Normalize the data using Min-Max Scaling (scales values between 0 and 1)

scaler = MinMaxScaler(feature_range=(0, 1))

data = scaler.fit_transform(data)

// ==========================================

// 🔹 Function to Prepare Data for Training

// ==========================================

def create_dataset(data, time_step=1):

    """

    Prepares the dataset by creating input sequences (X) and corresponding labels (Y).

    Args:

    - data (numpy array): Normalized time-series data.

    - time_step (int): Number of time steps to use as input.

    Returns:

    - X (numpy array): Input sequences.

    - Y (numpy array): Target values (next step in sequence).

    """

    X, Y = [], []

    for i in range(len(data) - time_step - 1):

        a = data[i:(i + time_step), 0]  // Slice a sequence from data

        X.append(a)  // Append the sequence to X

        Y.append(data[i + time_step, 0])  // Append the next value to Y

    return np.array(X), np.array(Y)

// Define number of time steps (how many past days to use for predicting the next day)

time_step = 60  // Using past 60 days of data

// Generate training sequences and labels

X, Y = create_dataset(data, time_step)

// ==========================================

// 🔹 Debugging: Print Dataset Shapes

// ==========================================

// Print dataset information for debugging

print("Length of data:", len(data))

print("Length of X:", len(X))

print("Shape of first element in X:", X[0].shape if len(X) > 0 else "X is empty")

print("Shape of Y:", Y.shape)

// ==========================================

// 🔹 Reshape Data for Model Compatibility

// ==========================================

// Ensure X has the correct shape for LSTMs/Transformers (samples, time_steps, features)

if len(X) > 0:

    X = X.reshape(X.shape[0], X.shape[1], 1)  // Adding feature dimension

// Final shape checks

print("Shape of X after reshape:", X.shape)  // Expected: (num_samples, time_steps, 1)

print("Shape of Y:", Y.shape)  // Expected: (num_samples,)


Steps to Build and Train a Transformer Model for Time-Series Forecasting

1️⃣ Define an Embedding Layer

  • Convert input time-series sequences into dense vector representations using a Dense layer.
  • Helps the model capture meaningful features from raw time-series data.

2️⃣ Stack Multiple Transformer Blocks

  • Use self-attention mechanisms to capture long-range dependencies.
  • Each block contains:
    • Multi-Head Self-Attention (learns dependencies between time steps).
    • Feedforward Network (processes extracted features).
    • Residual Connections & Layer Normalization (improves stability).

3️⃣ Add a Final Dense Layer for Prediction

  • Flatten the output from Transformer layers.
  • Use a Dense layer with one neuron to predict the next value in the time series.

4️⃣ Compile the Model

  • Use the Adam optimizer for efficient gradient updates.
  • Set Mean Squared Error (MSE) as the loss function for continuous value prediction.

5️⃣ Train the Model

  • Feed prepared time-series sequences as input.
  • Optimize the model using gradient descent to minimize prediction error.
  • Evaluate performance using Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE).

import numpy as np

import pandas as pd

import tensorflow as tf

import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler

from tensorflow.keras.layers import Layer, Dense, LayerNormalization, Dropout, MultiHeadAttention, Input, Flatten

from tensorflow.keras.models import Model

// ==========================================

// 🔹 Load and Preprocess Time-Series Data

// ==========================================

// Load stock price dataset (assuming it has a 'Close' column)

data = pd.read_csv('/content/stock_prices.csv')

// Extract closing prices and convert to NumPy array

data = data[['Close']].values

// Normalize data to range [0,1] for stable training

scaler = MinMaxScaler(feature_range=(0, 1))

data = scaler.fit_transform(data)

// ==========================================

// 🔹 Prepare Sequences for Training

// ==========================================

def create_dataset(data, time_step=60):

    """

    Generates sequences for time-series forecasting.

    Args:

    - data (numpy array): Normalized time-series data.

    - time_step (int): Number of past steps used for prediction.

    Returns:

    - X (numpy array): Input sequences.

    - Y (numpy array): Next-step predictions.

    """

    X, Y = [], []

    for i in range(len(data) - time_step - 1):

        X.append(data[i:(i + time_step), 0])  // Sequence of 'time_step' past values

        Y.append(data[i + time_step, 0])  // The next value in the series

    return np.array(X), np.array(Y)

// Define the time step (lookback period)

time_step = 60

// Create training sequences

X, Y = create_dataset(data, time_step)

// Debugging: Print dataset shapes

print(f"Total data points: {len(data)}")

print(f"Input shape (X): {X.shape}")

print(f"Target shape (Y): {Y.shape}")

// Reshape X for Transformer model (samples, time_steps, features)

X = X.reshape(X.shape[0], X.shape[1], 1)

// ==========================================

// 🔹 Define Transformer Block

// ==========================================

class TransformerBlock(Layer):

    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):

        """

        Transformer block with Multi-Head Self-Attention and a Feedforward Network.

        Args:

        - embed_dim (int): Embedding dimension.

        - num_heads (int): Number of attention heads.

        - ff_dim (int): Size of the feedforward network.

        - rate (float): Dropout rate.

        """

        super(TransformerBlock, self).__init__()

        // Multi-Head Self-Attention layer

        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)

        // Feedforward network

        self.ffn = tf.keras.Sequential([

            Dense(ff_dim, activation="relu"),  // Expands feature space

            Dense(embed_dim)  // Projects back to embedding size

        ])

        // Layer normalization and dropout layers

        self.layernorm1 = LayerNormalization(epsilon=1e-6)

        self.layernorm2 = LayerNormalization(epsilon=1e-6)

        self.dropout1 = Dropout(rate)

        self.dropout2 = Dropout(rate)

    def call(self, inputs, training, mask=None):

        """

        Forward pass of the Transformer block.

        Args:

        - inputs (Tensor): Input tensor (batch_size, time_steps, embed_dim).

        - training (bool): Whether the model is in training mode.

        - mask (Tensor, optional): Mask for self-attention.

        Returns:

        - Tensor of shape (batch_size, time_steps, embed_dim).

        """

        attn_output = self.att(inputs, inputs, inputs, attention_mask=mask)

        attn_output = self.dropout1(attn_output, training=training)

        out1 = self.layernorm1(inputs + attn_output)  // Residual connection + normalization

        ffn_output = self.ffn(out1)

        ffn_output = self.dropout2(ffn_output, training=training)

        return self.layernorm2(out1 + ffn_output)  // Residual connection + normalization

// ==========================================

// 🔹 Build the Transformer Model

// ==========================================

def build_transformer_model(time_steps, embed_dim, num_heads, ff_dim, num_layers):

    """

    Builds a Transformer model for time-series forecasting.

    Args:

    - time_steps (int): Number of past time steps used as input.

    - embed_dim (int): Embedding dimension.

    - num_heads (int): Number of attention heads.

    - ff_dim (int): Size of feedforward network.

    - num_layers (int): Number of Transformer layers.

    Returns:

    - Compiled Keras model.

    """

    inputs = Input(shape=(time_steps, 1))  // Input shape: (batch_size, time_steps, features)

    // Convert inputs into dense vector embeddings

    x = Dense(embed_dim)(inputs)

    // Stack multiple Transformer blocks

    for _ in range(num_layers):

        x = TransformerBlock(embed_dim, num_heads, ff_dim)(x)

    // Flatten and output a single prediction

    x = Flatten()(x)

    outputs = Dense(1)(x)  // Predict next time step

    // Define model

    model = Model(inputs=inputs, outputs=outputs)

    // Compile the model using Adam optimizer and Mean Squared Error loss

    model.compile(optimizer='adam', loss='mse', metrics=['mae'])

    return model

// Define model parameters

embed_dim = 128

num_heads = 4

ff_dim = 512

num_layers = 4

// Build the Transformer model

model = build_transformer_model(time_step, embed_dim, num_heads, ff_dim, num_layers)

// ==========================================

// 🔹 Train the Transformer Model

// ==========================================

// Train model for 20 epochs with batch size 32

model.fit(X, Y, epochs=20, batch_size=32)

// ==========================================

// 🔹 Evaluate the Model and Make Predictions

// ==========================================

// Generate predictions

predictions = model.predict(X)

// Convert predictions back to original scale

predictions = scaler.inverse_transform(predictions)

// ==========================================

// 🔹 Plot the Predictions

// ==========================================

plt.figure(figsize=(10, 5))

plt.plot(scaler.inverse_transform(data), label='True Data')  // Original data

plt.plot(np.arange(time_step, time_step + len(predictions)), predictions, label='Predictions')

plt.xlabel('Time')

plt.ylabel('Stock Prices')

plt.legend()

plt.show()


📌 Explanation of Sequential Data Layers in TensorFlow

1️⃣ Recurrent Neural Networks (RNNs)

  • Process sequences one step at a time, maintaining a hidden state to retain memory.
  • Struggle with long-term dependencies due to vanishing gradients.
  • Suitable for short sequences (e.g., small text processing tasks).

2️⃣ Long Short-Term Memory Networks (LSTMs)

  • An improved version of RNNs with gates (forget, input, and output) to control memory flow.
  • Overcomes vanishing gradient issues, making them effective for long-term dependencies.
  • Used in speech recognition, text generation, and time-series forecasting.

3️⃣ Gated Recurrent Units (GRUs)

  • Similar to LSTMs but with fewer parameters (faster training).
  • Uses update and reset gates to manage memory efficiently.
  • Performs well on smaller datasets and real-time applications.

4️⃣ Convolutional Layers for Sequence Data (Conv1D)

  • Apply 1D convolution filters to detect patterns in sequential data.
  • Suitable for time-series classification and NLP tasks where spatial relationships matter.
  • Faster than RNNs but less effective for long-range dependencies.

🚀 Most Used Layer Today: Transformers

While LSTMs and GRUs were dominant, Transformers have become the most widely used architecture for sequential data processing.

Advantages of Transformers Over RNNs/LSTMs:

  • Parallelization → Processes entire sequences at once (faster training).
  • Self-Attention Mechanism → Captures long-range dependencies more effectively.
  • Scalability → Used in GPT, BERT, Vision Transformers (ViTs), and Decision Transformers for RL.

Where Transformers Are Used:

  • Time-series forecasting (Temporal Fusion Transformers, Informer).
  • Natural Language Processing (NLP) (BERT, GPT, T5).
  • Speech recognition (Whisper, Wav2Vec).

Text Processing in TensorFlow

from tensorflow.keras.layers import TextVectorization

# ==========================================

# 🔹 Sample Text Data

# ==========================================

texts = [

    "Hello, how are you?",

    "I am fine, thank you.",

    "How about you?",

    "I am good too."

]

# ==========================================

# 🔹 Define the TextVectorization Layer

# ==========================================

vectorizer = TextVectorization(

    output_mode='int',         # Converts text into integer token sequences

    max_tokens=100,            # Limits vocabulary size to 100 unique tokens

    output_sequence_length=10  # Pads or truncates sequences to length 10

)

# ==========================================

# 🔹 Adapt the Vectorizer to the Text Data

# ==========================================

# This step builds the vocabulary based on the provided text corpus

vectorizer.adapt(texts)

# ==========================================

# 🔹 Vectorize the Text Data

# ==========================================

text_vectorized = vectorizer(texts)

# ==========================================

# 🔹 Print the Vectorized Output

# ==========================================

print("Vectorized text data:\n", text_vectorized.numpy())


Keras - Unsupervised learning

  • Clustering
    • Groups similar data points into clusters.
    • Useful for pattern recognition, customer segmentation, anomaly detection.
    • Examples: k-means clustering, hierarchical clustering, DBSCAN.
  • Association
    • Identifies relationships between variables in large datasets.
    • Common in market basket analysis to find product purchase patterns.
    • Examples: Apriori algorithm, Eclat algorithm, FP-Growth.
  • Dimensionality Reduction
    • Reduces the number of features while retaining essential information.
    • Used to improve computational efficiency and visualization.
    • Examples:
      • Principal Component Analysis (PCA) – Projects data onto principal components to maximize variance.
      • t-Distributed Stochastic Neighbor Embedding (t-SNE) – Preserves local similarities for high-dimensional data visualization.
      • Autoencoders – Neural networks that learn efficient data representations.
  • Anomaly Detection
    • Identifies unusual or rare data points that deviate significantly from normal patterns.
    • Used for fraud detection, network security, medical diagnosis, and predictive maintenance.
    • Examples:
      • Isolation Forest – Detects anomalies by recursively partitioning data and isolating outliers.
      • One-Class SVM – Learns a boundary around normal data points and classifies outliers.
      • Autoencoders – Trained to reconstruct normal data, anomalies show high reconstruction error.

Autoencoders (AEs)

A type of neural network used for dimensionality reduction and feature learning by encoding data into a lower-dimensional latent space and reconstructing it.

Autoencoder2.png

Architecture:

  1. Encoder: Compresses input into a smaller latent space representation (bottleneck - the compressed representation that contains the most important features).
  2. Decoder: Reconstructs input from the latent compressed representation.
  3. Loss Function: Typically Mean Squared Error (MSE) or Binary Cross-Entropy (BCE) to minimize reconstruction error.

Variants of Autoencoders:

  • Basic Autoencoders: A hidden layer in both encoder and decoder.
  • Denoising Autoencoders (DAE): Trained to remove noise from corrupted inputs.
  • Sparse Autoencoders: Uses sparsity constraints to learn key features.
  • Variational Autoencoders (VAE): Uses probabilistic latent variables for generative modeling.
  • Convolutional Autoencoders (CAE): Designed for image data using convolutional layers.

Applications:

  • Anomaly detection (fraud, medical images).
  • Data compression.
  • Feature extraction for downstream ML tasks.
  • Image denoising and super-resolution.

Simple Autoencoder in Keras:

import tensorflow as tf

from tensorflow.keras.layers import Input, Dense

from tensorflow.keras.models import Model

# Define the encoder

input_layer = Input(shape=(784,))

encoded = Dense(64, activation='relu')(input_layer)

#Bottleneck

bottleneck = Dense(32, activation='relu')(encoder)

# Define the decoder

decoded = Dense(64, activation='relu'(bottleneck)

output_layer = Dense(784, activation='sigmoid')(decoded)

# Combine the encoder and decoder into an autoencoder model

autoencoder = Model(input_layer, output_layer)

# Compile the autoencoder

autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

# Summary of the model

autoencoder.summary()


Training the autoencoder

# Load the MNIST dataset

(x_train, _), (x_test, _) = tf.keras.datasets.mnist.load_data()

# Normalize the data

x_train = x_train.astype('float32') / 255.

x_test = x_test.astype('float32') / 255.

x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))

x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:])))

# Train the autoencoder

autoencoder.fit(x_train, x_train,

                epochs=50,

                batch_size=256,

                shuffle=True,

                validation_data=(x_test, x_test))

# Unfreeze the top layers of the encoder

for layer in autoencoder.layers[-4:]:

    layer.trainable = True

# Compile the model again

autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

# Train the model again

autoencoder.fit(x_train, x_train, epochs=10, batch_size=256, shuffle=True, validation_data=(x_test, x_test))


Generative Adversarial Networks (GANs)

GANs consist of two neural networks in competition, used for generating realistic synthetic data.

Components:

  1. Generator (G)
    • Learns to create realistic data from random noise.
    • Maps input noise to a data distribution (e.g., images, text).
  2. Discriminator (D)
    • Evaluates if data is real (from training set) or fake (from Generator).
    • Acts as a binary classifier (real vs. fake).

Adversarial Process:

  • The Generator tries to fool the Discriminator by generating increasingly realistic data.
  • The Discriminator improves its ability to distinguish real from fake data.
  • Over time, the Generator produces data indistinguishable from real samples.

Loss Function:

  • Minimax Game:
    • Generator minimizes Discriminator’s ability to differentiate real vs. fake.
    • Discriminator maximizes its accuracy.
  • Typically, Binary Cross-Entropy Loss is used.

Variants of GANs:

  • DCGAN – Uses convolutional layers for image generation.
  • CGAN – Conditional GAN, generates data based on labels.
  • WGAN – Uses Wasserstein distance for stable training.
  • StyleGAN – Generates high-quality, realistic images.

Applications:

  • Image Generation: Deepfake technology, artistic creation.
  • Data Augmentation: Generating training samples for ML models.
  • Super-Resolution: Enhancing low-resolution images.
  • Anomaly Detection: Detecting fraudulent transactions, medical anomalies.

from tensorflow.keras.layers import LeakyReLU

import numpy as np

# Define the generator model

def build_generator():

    model = tf.keras.Sequential()

    model.add(Dense(128, input_dim=100))

    model.add(LeakyReLU(alpha=0.01))

    model.add(Dense(784, activation='tanh'))

    return model

# Define the discriminator model

def build_discriminator():

    model = tf.keras.Sequential()

    model.add(Dense(128, input_shape=(784,)))

    model.add(LeakyReLU(alpha=0.01))

    model.add(Dense(1, activation='sigmoid'))

    return model

# Build and compile the discriminator

discriminator = build_discriminator()

discriminator.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Build the generator

generator = build_generator()

# Create the GAN by combining the generator and discriminator

discriminator.trainable = False

gan_input = Input(shape=(100,))

generated_image = generator(gan_input)

gan_output = discriminator(generated_image)

gan = Model(gan_input, gan_output)

# Compile the GAN

gan.compile(optimizer='adam', loss='binary_crossentropy')

def train_gan(gan, generator, discriminator, x_train, epochs=400, batch_size=128):

    # Loop through epochs

    for epoch in range(epochs):

        # Generate random noise as input for the generator

        noise = np.random.normal(0, 1, (batch_size, 100))

        generated_images = generator.predict(noise)

        # Get a random set of real images

        idx = np.random.randint(0, x_train.shape[0], batch_size)

        real_images = x_train[idx]

        # Labels for real and fake images

        real_labels = np.ones((batch_size, 1))

        fake_labels = np.zeros((batch_size, 1))

# Train the discriminator on real and fake images separately

d_loss_real = discriminator.train_on_batch(real_images, real_labels)

d_loss_fake = discriminator.train_on_batch(generated_images, fake_labels)

# Calculate the average loss for the discriminator

d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)

# Generate new noise and train the generator through the GAN model

# (note: we train the generator via the GAN model, where the discriminator's weights are frozen)

noise = np.random.normal(0, 1, (batch_size, 100))

g_loss = gan.train_on_batch(noise, real_labels, return_dict=True)

# Print the progress every 10 epochs

if epoch % 10 == 0:

    print(f"Epoch {epoch} - Discriminator Loss: {d_loss[0]}, Generator Loss: {g_loss['loss']}")


Basic GAN Implementation:

import tensorflow as tf

from tensorflow.keras.layers import Dense, LeakyReLU, BatchNormalization, Reshape, Flatten, Input

from tensorflow.keras.models import Model, Sequential

# Define the generator model

def build_generator():

    model = Sequential()

    model.add(Dense(256, input_dim=100))

    model.add(LeakyReLU(alpha=0.2))

    model.add(BatchNormalization(momentum=0.8))

    model.add(Dense(512))

    model.add(LeakyReLU(alpha=0.2))

    model.add(BatchNormalization(momentum=0.8))

    model.add(Dense(1024))

    model.add(LeakyReLU(alpha=0.2))

    model.add(BatchNormalization(momentum=0.8))

    model.add(Dense(28 * 28 * 1, activation='tanh'))

    model.add(Reshape((28, 28, 1)))

return model

# Define the discriminator model

def build_discriminator():

    model = Sequential()

    model.add(Flatten(input_shape=(28, 28, 1)))

    model.add(Dense(512))

    model.add(LeakyReLU(alpha=0.2))

    model.add(Dense(256))

    model.add(LeakyReLU(alpha=0.2))

    model.add(Dense(1, activation='sigmoid'))

    return model

# Build and compile the discriminator

discriminator = build_discriminator()

discriminator.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Build the generator

generator = build_generator()

# Create the GAN by stacking the generator and the discriminator

gan_input = Input(shape=(100,))

generated_image = generator(gan_input)

discriminator.trainable = False

gan_output = discriminator(generated_image)

gan = Model(gan_input, gan_output)

gan.compile(loss='binary_crossentropy', optimizer='adam')


Train the GAN:


import numpy as np

from tensorflow.keras.datasets import mnist

# Load and preprocess the MNIST dataset

(x_train, _), (_, _) = mnist.load_data()

x_train = x_train / 127.5 - 1.

x_train = np.expand_dims(x_train, axis=-1)

# Training parameters

batch_size = 64

epochs = 10000

sample_interval = 1000

# Adversarial ground truths

real = np.ones((batch_size, 1))

fake = np.zeros((batch_size, 1))

# Training loop

for epoch in range(epochs):

    # Train the discriminator

    idx = np.random.randint(0, x_train.shape[0], batch_size)

    real_images = x_train[idx]

   

    noise = np.random.normal(0, 1, (batch_size, 100))

    generated_images = generator.predict(noise)

   

    d_loss_real = discriminator.train_on_batch(real_images, real)

    d_loss_fake = discriminator.train_on_batch(generated_images, fake)

   

    d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)

# Train the generator

noise = np.random.normal(0, 1, (batch_size, 100))

g_loss = gan.train_on_batch(noise, real)

# Print the progress

if epoch % sample_interval == 0:

    print(f"{epoch} [D loss: {d_loss[0]}] [D accuracy: {100 * d_loss[1]}] [G loss: {g_loss}]")


Evaluating the GAN


import matplotlib.pyplot as plt

def sample_images(epoch):

    noise = np.random.normal(0, 1, (25, 100))

    gen_images = generator.predict(noise)

    gen_images = 0.5 * gen_images + 0.5

    fig, axs = plt.subplots(5, 5, figsize=(10, 10))

    count = 0

    for i in range(5):

        for j in range(5):

            axs[i, j].imshow(gen_images[count, :, :, 0], cmap='gray')

            axs[i, j].axis('off')

            count += 1

    plt.show()

# Sample images at the end of training

sample_images(epochs)


Diffusion Models

Diffusion models are a class of generative models used for image generation, denoising, and data augmentation by progressively removing noise from randomly sampled data.


Key Concepts:

  1. Image Generation:
    • Generates high-quality images from random noise.
    • Similar to GANs and VAEs, but offers more stable training.
  2. Image Denoising:
    • Removes noise from images by learning a reverse diffusion process.
    • Useful for enhancing low-quality images.
  3. Data Augmentation:
    • Generates synthetic training samples to improve model robustness.
    • Helps in low-data or imbalanced dataset scenarios.

How Diffusion Models Work:

  1. Forward Diffusion Process (Noise Addition):
    • Gradually adds Gaussian noise to training images over T time steps.
    • Converts images into pure noise.
  2. Reverse Diffusion Process (Denoising):
    • Trains a neural network (U-Net, Transformer) to reverse the noise step by step.
    • Recovers the original image structure.
  3. Image Generation:
    • Starts from random noise and progressively denoises it into a realistic image.

Popular Diffusion Models:

  • DDPM (Denoising Diffusion Probabilistic Models) – Introduced by Ho et al., widely used for image generation.
  • Stable Diffusion – Open-source model for generating high-quality images from text prompts.
  • DALL·E 2 – Uses diffusion models for text-to-image synthesis.
  • Imagen (Google Research) – High-resolution text-to-image generation.

Applications:

  • Art & Content Creation: Generates realistic and artistic images from text descriptions.
  • Super-Resolution: Enhances low-resolution images.
  • Data Augmentation: Creates synthetic data for training deep learning models.
  • Medical Imaging: Generates synthetic scans for training AI models in healthcare.
  • Inpainting & Restoration: Fills missing parts in images or reconstructs corrupted images. ----import tensorflow as tf from tensorflow.keras.layers import Input, Conv2D, Flatten, Dense, Reshape, Conv2DTranspose from tensorflow.keras.models import Model import numpy as np # Define the diffusion model architecture input_layer = Input(shape=(28, 28, 1)) x = Conv2D(32, (3, 3), activation='relu', padding='same')(input_layer) x = Conv2D(64, (3, 3), activation='relu', padding='same')(x) x = Flatten()(x) x = Dense(128, activation='relu')(x) x = Dense(784, activation='sigmoid')(x) output_layer = Reshape((28, 28, 1))(x) diffusion_model = Model(input_layer, output_layer) # Compile the model diffusion_model.compile(optimizer='adam', loss='binary_crossentropy') # Summary of the model diffusion_model.summary()

from tensorflow.keras.datasets import mnist

# Load the dataset

(x_train, _), (x_test, _) = mnist.load_data()

x_train = x_train.astype('float32') / 255.

x_test = x_test.astype('float32') / 255.

x_train = np.expand_dims(x_train, axis=-1)

x_test = np.expand_dims(x_test, axis=-1)

# Add noise to the images

noise_factor = 0.5

x_train_noisy = x_train + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_train.shape)

x_test_noisy = x_test + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_test.shape)

x_train_noisy = np.clip(x_train_noisy, 0., 1.)

x_test_noisy = np.clip(x_test_noisy, 0., 1.)

# Train the model

diffusion_model.fit(x_train_noisy, x_train, epochs=50, batch_size=128,

                    shuffle=True, validation_data=(x_test_noisy, x_test))

import matplotlib.pyplot as plt

# Predict the denoised images

denoised_images = diffusion_model.predict(x_test_noisy)

# Visualize the results

n = 10  # Number of digits to display

plt.figure(figsize=(20, 6))

for i in range(n):

    # Display original

    ax = plt.subplot(3, n, i + 1)

    plt.imshow(x_test[i].reshape(28, 28), cmap='gray')

    ax.get_xaxis().set_visible(False)

ax.get_yaxis().set_visible(False)

# Display noisy

ax = plt.subplot(3, n, i + 1 + n)

plt.imshow(x_test_noisy[i].reshape(28, 28), cmap='gray')

ax.get_xaxis().set_visible(False)

ax.get_yaxis().set_visible(False)

# Display denoised

ax = plt.subplot(3, n, i + 1 + 2 * n)

plt.imshow(denoised_images[i].reshape(28, 28), cmap='gray')

ax.get_xaxis().set_visible(False)


# Unfreeze the top layers of the model

for layer in diffusion_model.layers[-4:]:

    layer.trainable = True

# Compile the model again

diffusion_model.compile(optimizer='adam', loss='binary_crossentropy')

# Train the model again

diffusion_model.fit(x_train_noisy, x_train, epochs=10, batch_size=128,

                    shuffle=True, validation_data=(x_test_noisy, x_test))

TensorFlow for Unsupervised Learning

Building a Clustering Model with TensorFlow

import tensorflow as tf

from tensorflow.keras.datasets import mnist

import numpy as np

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

# Load and preprocess the MNIST dataset

(x_train, _), (_, _) = mnist.load_data()

x_train = x_train.astype('float32') / 255.0

x_train = x_train.reshape(-1, 28 * 28)

# Apply K-means clustering

kmeans = KMeans(n_clusters=10)

kmeans.fit(x_train)

labels = kmeans.labels_


# Display a few clustered images

def display_cluster_images(x_train, labels):

    plt.figure(figsize=(10, 10))

    for i in range(10):

        idxs = np.where(labels == i)[0]

        for j in range(10):

            plt_idx = i * 10 + j + 1

            plt.subplot(10, 10, plt_idx)

            plt.imshow(x_train[idxs[j]].reshape(28, 28), cmap='gray')

            plt.axis('off')

    plt.show()

display_cluster_images(x_train, labels)

Building a Dimensionality Reduction Model with TensorFlow

from tensorflow.keras.layers import Input, Dense

from tensorflow.keras.models import Model

# Define the autoencoder model

input_layer = Input(shape=(784,))

encoded = Dense(64, activation='relu')(input_layer)

bottleneck = Dense(32, activation='relu')(encoded)

decoded = Dense(64, activation='relu')(bottleneck)

output_layer = Dense(784, activation='sigmoid')(decoded)

autoencoder = Model(input_layer, output_layer)

# Compile the model

autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

# Train the autoencoder

autoencoder.fit(x_train, x_train, epochs=50, batch_size=256, shuffle=True, validation_split=0.2)

Evaluating the Dimensionality Reduction Model

from sklearn.manifold import TSNE

# Get the compressed representations

encoder = Model(input_layer, bottleneck)

compressed_data = encoder.predict(x_train)

# Use t-SNE for 2D visualization

tsne = TSNE(n_components=2)

compressed_2d = tsne.fit_transform(compressed_data)

# Plot the compressed data

plt.scatter(compressed_2d[:, 0], compressed_2d[:, 1], c=labels, cmap='viridis')

plt.colorbar()

plt.show()

Custom Keras Training Loops

import tensorflow as tf

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

# Create a simple model

model = Sequential([

    Dense(64, activation='relu'),

    Dense(10)

])

# Custom training loop

optimizer = tf.keras.optimizers.Adam()

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Load or create your training dataset here

# Example:

x_train = tf.random.uniform((100, 10))  # Replace with your actual training data

y_train = tf.random.uniform((100,), maxval=10, dtype=tf.int64)  # Replace with your actual training labels

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(32)

for epoch in range(10):

    for x_batch, y_batch in train_dataset:

        with tf.GradientTape() as tape:

            logits = model(x_batch, training=True)

            loss = loss_fn(y_batch, logits)

        grads = tape.gradient(loss, model.trainable_weights)

        optimizer.apply_gradients(zip(grads, model.trainable_weights))

   

    print(f'Epoch {epoch + 1}, Loss: {loss.numpy()}')

Keras Specialized Layers

from tensorflow.keras.layers import Layer

import tensorflow as tf

class CustomDenseLayer(Layer):

    def __init__(self, units=32):

        super(CustomDenseLayer, self).__init__()

        self.units = units

    def build(self, input_shape):

        self.w = self.add_weight(shape=(input_shape[-1], self.units),

                                 initializer='random_normal',

                                 trainable=True)

        self.b = self.add_weight(shape=(self.units,),

                                 initializer='zeros',

                                 trainable=True)

    def call(self, inputs):

        return tf.matmul(inputs, self.w) + self.b

model = Sequential([CustomDenseLayer(64), Dense (10)])

Advanced callback functions in Keras

from tensorflow.keras.callbacks import Callback

class CustomCallback(Callback):

    def on_epoch_end(self, epoch, logs=None):

        logs = logs or {}

        print(f'End of epoch {epoch}, loss: {logs.get("loss")}, accuracy: {logs.get("accuracy")}')

# Usage in model training

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',

              metrics=['accuracy'])

model.fit(train_dataset, epochs=10, callbacks=[CustomCallback()])

Mixed precision training

from tensorflow.keras import mixed_precision

# Enable mixed precision

mixed_precision.set_global_policy('mixed_float16')

# Model definition

model = Sequential([Dense(64, activation='relu'), Dense(10)])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Training the model

model.fit(train_dataset, epochs=10)

Custom Training Loop - In depth

!pip install tensorflow==2.16.2 matplotlib==3.9.1

import warnings

warnings.filterwarnings("ignore", category=UserWarning)

import tensorflow as tf

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Flatten

# Prepare a simple dataset

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

x_train, x_test = x_train / 255.0, x_test / 255.0

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(32)

optimizer = tf.keras.optimizers.Adam()

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model = Sequential([

    Flatten(input_shape=(28, 28)),

    Dense(128, activation='relu'),

    Dense(10)

])

epochs = 5

for epoch in range(epochs):

    print(f'Start of epoch {epoch + 1}')

    for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):  # Added this line for proper code functionality

        with tf.GradientTape() as tape:

            logits = model(x_batch_train, training=True)

            loss_value = loss_fn(y_batch_train, logits)

        grads = tape.gradient(loss_value, model.trainable_weights)

        optimizer.apply_gradients(zip(grads, model.trainable_weights))

        if step % 200 == 0:

            print(f'Epoch {epoch + 1} Step {step}: Loss = {loss_value.numpy()}')

Hyperparameter tuning using Keras Tuner

Keras Tuner is a library that automates the process of hyperparameter tuning for deep learning models.


Key Features:

  • Automates hyperparameter tuning to find the best model configuration.
  • Supports multiple search algorithms for different tuning strategies.
  • Integrates seamlessly with TensorFlow and Keras.

Search Algorithms:

  1. Random Search
    • Randomly selects hyperparameters within a defined range.
    • Efficient for quick experimentation.
  2. Hyperband
    • Uses an adaptive resource allocation strategy to evaluate models efficiently.
    • Focuses on the most promising hyperparameter configurations early.
  3. Bayesian Optimization
    • Uses past trials to model the probability of hyperparameters leading to better results.
    • Balances exploration and exploitation for optimal tuning.

How Keras Tuner Works:

  1. Define a model-building function with tunable hyperparameters.
  2. Select a search algorithm (Random Search, Hyperband, Bayesian Optimization).
  3. Run the tuner to find the best hyperparameter values.
  4. Retrieve and use the best model configuration.

# Install required package

!pip install keras-tuner

# Import necessary libraries

import keras_tuner as kt

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Flatten

from tensorflow.keras.datasets import mnist

from tensorflow.keras.optimizers import Adam

# Load and preprocess the MNIST dataset

(x_train, y_train), (x_val, y_val) = mnist.load_data()

x_train, x_val = x_train / 255.0, x_val / 255.0  # Normalize data

# Define a model-building function with hyperparameters

def build_model(hp):

    model = Sequential([

        Flatten(input_shape=(28, 28)),

        Dense(units=hp.Int('units', min_value=32, max_value=512, step=32), activation='relu'),

        Dense(10, activation='softmax')

    ])

   

    model.compile(

        optimizer=Adam(learning_rate=hp.Float('learning_rate', min_value=1e-4, max_value=1e-2, sampling='LOG')),

        loss='sparse_categorical_crossentropy',

        metrics=['accuracy']

    )

   

    return model

# Configure the hyperparameter search

tuner = kt.RandomSearch(

    build_model,

    objective='val_accuracy',

    max_trials=10,

    executions_per_trial=2,

    directory='my_dir',

    project_name='intro_to_kt'

)

# Run the hyperparameter search

tuner.search(x_train, y_train, epochs=5, validation_data=(x_val, y_val))

# Retrieve the best hyperparameters

best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]

print(f"""

The optimal number of units in the first dense layer is {best_hps.get('units')}.

The optimal learning rate for the optimizer is {best_hps.get('learning_rate')}.

""")

# Build and train the optimized model

model = tuner.hypermodel.build(best_hps)

model.summary()

model.fit(x_train, y_train, epochs=10, validation_split=0.2)

# Evaluate the final model

test_loss, test_acc = model.evaluate(x_val, y_val)

print(f'Test accuracy: {test_acc}')

Model Optimization

Weight Initialization

Proper weight initialization is crucial in deep learning to avoid issues like vanishing or exploding gradients and ensure faster convergence during training.

Key Weight Initialization Methods:

1. Xavier (Glorot) Initialization

  • Designed for sigmoid and tanh activations.
  • Ensures variance remains constant across layers.

2. He Initialization

  • Best suited for ReLU and Leaky ReLU activations.
  • Helps mitigate the dying ReLU problem.

Why Use These Initializations?

Prevents vanishing gradients (important for deep networks).

Stabilizes training, leading to faster convergence.

Improves model accuracy by maintaining proper variance across layers.

Comparison Table:

Initialization Suitable Activation Best Use Case
Xavier (Glorot) sigmoid, tanh Balanced activations
He Initialization ReLU, LeakyReLU Deep ReLU networks

from tensorflow.keras.initializers import HeNormal

layer = Dense(128, activation='relu', kernel_initializer=HeNormal())


Learning Rate Scheduling

Why Adjust the Learning Rate?

  • A static learning rate may not be optimal throughout training.
  • Adaptive learning rates help in:
    • Faster convergence in early epochs.
    • More stable updates in later epochs.

Using Learning Rate Scheduling in Keras

  1. Exponential Decay Schedule
    • Decreases learning rate after a certain number of epochs.

from tensorflow.keras.callbacks import LearningRateScheduler

import tensorflow as tf

def scheduler(epoch, lr):

    if epoch < 10:

        return lr

    else:

        return float(lr * tf.math.exp(-0.1))

lr_scheduler = LearningRateScheduler(scheduler)

Model Training

Dataset Preparation (MNIST Example)

from tensorflow.keras.datasets import mnist

from tensorflow.keras.utils import to_categorical

# Load dataset

(x_train, y_train), (x_val, y_val) = mnist.load_data()

# Normalize inputs

x_train, x_val = x_train.astype('float32') / 255.0, x_val.astype('float32') / 255.0

# Reshape for neural network input

x_train = x_train.reshape(-1, 28, 28)

x_val = x_val.reshape(-1, 28, 28)

from tensorflow.keras.optimizers import Adam

# Compile the model

model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

#Training the Model with a Learning Rate Scheduler

history = model.fit(x_train, y_train,

                    validation_data=(x_val, y_val),

                    epochs=20,

                    callbacks=[lr_scheduler])

Evaluating the Model on Test Data

from tensorflow.keras.optimizers import Adam

# Compile the model

model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model

history = model.fit(x_train, y_train,

                    validation_data=(x_val, y_val),

                    epochs=20)


Deep Learning Model Optimization Techniques

1. Batch Normalization

What is Batch Normalization?

  • Normalizes the input layer by adjusting and scaling activations.
  • Helps in faster convergence and reduces internal covariate shift.

Implementation in Keras

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Flatten, BatchNormalization

# Define a simple model with Batch Normalization

model = Sequential([

    Flatten(input_shape=(28, 28)),

    Dense(128, activation='relu'),

    BatchNormalization(),  # Batch Normalization Layer

    Dense(10, activation='softmax')

])

2. Mixed Precision Training

What is Mixed Precision Training?

  • Uses both 16-bit and 32-bit floating-point types to speed up training on modern GPUs.
  • Reduces memory usage while maintaining model accuracy.

Implementation in Keras

from tensorflow.keras import mixed_precision

# Enable mixed precision

policy = mixed_precision.Policy('mixed_float16')

mixed_precision.set_global_policy(policy)

3. Model Pruning

What is Model Pruning?

  • Reduces the number of parameters in a model by removing less significant connections or neurons.
  • Helps compress models for deployment in resource-constrained environments.

Implementation in Keras

import tensorflow_model_optimization as tfmot

prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude

# Apply pruning to the model

pruning_params = {

    'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(

        initial_sparsity=0.0,

        final_sparsity=0.5,

        begin_step=0,

        end_step=2000

    )

}

model_pruned = prune_low_magnitude(model, **pruning_params)

4. Quantization

  • Reduces the precision of the numbers used to represent the model’s weights.
  • Helps in reducing model size and improving inference speed, especially for edge devices and mobile applications.

Implementation in Keras

import tensorflow as tf

# Convert the trained model to TensorFlow Lite format with quantization

converter = tf.lite.TFLiteConverter.from_keras_model(model)

converter.optimizations = [tf.lite.Optimize.DEFAULT]

quantized_model = converter.convert()

# Save the quantized model

with open('quantized_model.tflite', 'wb') as f:

    f.write(quantized_model)

Tensorflow for Model Optimization

Mixed precision training:


# Install TensorFlow (ensure you have the correct version)

!pip install tensorflow==2.16.2

# Import necessary libraries

import tensorflow as tf

from tensorflow.keras import layers, models, optimizers

from tensorflow.keras import mixed_precision

# Enable mixed precision training

mixed_precision.set_global_policy('mixed_float16')

# Load and preprocess the MNIST dataset

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Normalize the data to range [0,1]

x_train, x_test = x_train.astype('float32') / 255.0, x_test.astype('float32') / 255.0

# Define the model

model = models.Sequential([

    layers.Input(shape=(28, 28)),

    layers.Flatten(),

    layers.Dense(128, activation='relu'),

    layers.Dense(10, activation='softmax')

])

# Compile the model with optimizer and loss function

optimizer = optimizers.Adam(learning_rate=1e-3)

model.compile(optimizer=optimizer,

              loss='sparse_categorical_crossentropy',

              metrics=['accuracy'])

# Train the model

model.fit(x_train, y_train, epochs=5, validation_data=(x_test, y_test))

# Evaluate the model

test_loss, test_acc = model.evaluate(x_test, y_test)

print(f'Test Accuracy: {test_acc}')

Knowledge Distillation

# Install TensorFlow if not installed

!pip install tensorflow

# Import necessary libraries

import tensorflow as tf

from tensorflow.keras import layers, models, optimizers

import numpy as np

# Load and preprocess the MNIST dataset

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Normalize the dataset

x_train, x_test = x_train.astype('float32') / 255.0, x_test.astype('float32') / 255.0

# Use only a subset of the dataset for quick training

x_train, y_train = x_train[:1000], y_train[:1000]

x_test, y_test = x_test[:1000], y_test[:1000]

# Define the Teacher Model (Larger, more complex model)

teacher_model = models.Sequential([

    layers.Input(shape=(28, 28)),

    layers.Flatten(),

    layers.Dense(128, activation='relu'),

    layers.Dense(10, activation='softmax')

])

# Compile the Teacher Model

teacher_model.compile(optimizer=optimizers.Adam(),

                      loss='sparse_categorical_crossentropy',

                      metrics=['accuracy'])

# Assume the teacher model is already trained

# teacher_model.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test))

# Define the Student Model (Smaller, simpler model)

student_model = models.Sequential([

    layers.Input(shape=(28, 28)),

    layers.Flatten(),

    layers.Dense(32, activation='relu'),

    layers.Dense(10, activation='softmax')

])

# Compile the Student Model

student_model.compile(optimizer=optimizers.Adam(),

                      loss='sparse_categorical_crossentropy',

                      metrics=['accuracy'])

# Define the distillation loss function using TensorFlow operations

def distillation_loss(teacher_logits, student_logits, temperature=3):

    teacher_probs = tf.nn.softmax(teacher_logits / temperature)

    student_probs = tf.nn.softmax(student_logits / temperature)

    return tf.reduce_mean(tf.keras.losses.categorical_crossentropy(teacher_probs, student_probs))

# Define the function to train the student model using knowledge distillation

def train_student(student, teacher, x_train, y_train, batch_size=32, epochs=2, temperature=3):

    for epoch in range(epochs):

        num_batches = len(x_train) // batch_size

        for batch in range(num_batches):

            x_batch = x_train[batch * batch_size: (batch + 1) * batch_size]

            y_batch = y_train[batch * batch_size: (batch + 1) * batch_size]

            # Predict teacher logits for the batch

            teacher_logits = teacher.predict(x_batch)

            with tf.GradientTape() as tape:

                # Predict student logits for the batch

                student_logits = student(x_batch)

                # Compute distillation loss

                loss = distillation_loss(teacher_logits, student_logits, temperature)

            # Compute gradients and apply updates

            grads = tape.gradient(loss, student.trainable_variables)

            student.optimizer.apply_gradients(zip(grads, student.trainable_variables))

        print(f"Epoch {epoch + 1} completed. Loss: {loss.numpy()}")

# Train the student model using knowledge distillation

train_student(student_model, teacher_model, x_train, y_train, batch_size=32, epochs=2)


Loads and preprocesses the MNIST dataset

Defines a larger Teacher Model

Defines a smaller Student Model

Implements a custom Knowledge Distillation loss function

Trains the Student Model using the Teacher Model’s logits


Q-Learning with Keras

Q-Learning is a value-based reinforcement learning algorithm that enables an agent to learn an optimal policy through interaction with its environment. The objective is to maximize cumulative reward over time by updating a Q-value function, which estimates the expected reward for taking a given action in a particular state.

It maintains and updates a Q-table using the Bellman equation:

Bellman Equation.png


where:

  • Q(s,a) is the current Q-value for state s and action a,
  • α (learning rate) controls how much new information overrides the old,
  • r is the immediate reward,
  • s′ (next state) results from taking action a,
  • a′ (next action) is chosen in s′,
  • γ (discount factor) determines the importance of future rewards,
  • maxa′​Q(s′,a′) is the highest estimated Q-value for the next state.

Through repeated interactions, Q-learning converges to an optimal policy, enabling the agent to select the best actions to maximize long-term rewards.


Steps to implement Q-learning with Keras

1. Initialize the Environment and Parameters

  • Use a platform like OpenAI Gym (CartPole) to define the environment.
  • Initialize the Q-network (neural network) instead of a traditional Q-table.
  • Set key hyperparameters:
    • Learning rate (α): Determines how much new information overrides old values.
    • Discount factor (γ): Balances immediate vs. future rewards.
    • Exploration rate (ε): Controls trade-off between exploration (random actions) and exploitation (choosing the best action).

import gym

import numpy as np

# Initialize the environment

env = gym.make('CartPole-v1')

# Set hyperparameters

alpha = 0.001  # Learning rate

gamma = 0.99  # Discount factor

epsilon = 1.0  # Exploration rate

epsilon_min = 0.01

epsilon_decay = 0.995

episodes = 100

# Initialize the Q-table

state_size = env.observation_space.shape[0]

action_size = env.action_space.n

q_table = np.zeros((state_size, action_size))


2. Build the Q-Network with Keras

  • Create a deep neural network to approximate Q-values.
  • Input: State representation.
  • Output: Q-values for all possible actions.
  • Use Dense layers with ReLU activation, and an output layer with linear activation.
    • Input layer size = state size
    • Output layer size = action size
    • 2 to 3 hidden layers with ReLu Activation

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

from tensorflow.keras.optimizers import Adam

def build_q_network(state_size, action_size):

    model = Sequential()

    model.add(Dense(24, input_dim=state_size, activation='relu'))

    model.add(Dense(24, activation='relu'))

    model.add(Dense(action_size, activation='linear'))

    model.compile(loss='mse', optimizer=Adam(learning_rate=alpha))

    return model

# Build the Q-network

q_network = build_q_network(state_size, action_size)


3. Train the Q-Network

  • Get/Initialize the state
  • Select action
    • With Probability (epsilon)
    • Select Random action (exploration)
    • Select the action with the highest prediction Q Value (exploration)
  • Take action
    • Execute the Chosen action in the environment
  • Update Q-values
    • Use the Bellman equation
    • Compute the target Q-Value
    • Train the Q-network to minimize the difference between the predicted and target Q-Value
  • Repeat
    • Reduce the exploration rate (epsilon) to shit from exploration to exploitation.
  • Implement experience replay:
    • Store agent experiences (s,a,r,s′) in a replay memory.
    • Train the model by sampling mini-batches from memory.
  • Use target network:
    • Maintain a separate Q-network for stable training.
    • Update the target network periodically(Less frequently than the primary Q-network).
  • Use the Mean Squared Error (MSE) loss function with Adam optimizer.

episodes = 100

for episode in range(episodes):

    state, info = env.reset()

    state = np.reshape(state, [1, state_size])

    total_reward = 0

    for time in range(500):

        if np.random.rand() <= epsilon:

            action = np.random.choice(action_size)

        else:

            q_values = q_network.predict(state)

            action = np.argmax(q_values[0])

       

        next_state, reward, done, trunc, _ = env.step(action)

        next_state = np.reshape(next_state, [1, state_size])

        total_reward += reward

        if done: reward = -10

        q_target = q_network.predict(state)

    q_target[0][action] = reward + gamma * np.amax(q_network.predict(next_state)[0])

    q_network.fit(state, q_target, epochs=1, verbose=0)

    state = next_state

    if done:

        print(f"Episode: {episode+1}/{episodes}, Score: {total_reward}")

        break

    if epsilon > epsilon_min:

        epsilon *= epsilon_decay


4. Implement the Q-Learning Algorithm

  1. Initialize the environment.
  2. For each episode:
    • Reset the environment.
    • For each step:
      • Choose an action using ε-greedy policy.
      • Execute the action and observe the reward and next state.
      • Store the experience in memory.
      • Sample a mini-batch from memory.
      • Compute the target Q-value using the Bellman equation: Qtarget​(s,a)=r+γa′max​Q(s′,a′)
      • Update the Q-network.
      • Reduce ε (exploration rate) over time.
  3. Periodically update the target network.

5. Evaluate and Optimize

  • Run test episodes to measure performance.
  • Tune hyperparameters for better convergence.

for episode in range(10):

    state, info = env.reset()

    state = np.reshape(state, [1, state_size])

    total_reward = 0

    for time in range(500):

        env.render()

        q_values = q_network.predict(state)

        action = np.argmax(q_values[0])

        next_state, reward, done, trunc, _ = env.step(action)

        next_state = np.reshape(next_state, [1, state_size])

        total_reward += reward

        state = next_state

        if done:

            print(f"Episode: {episode+1}, Score: {total_reward}")

            break

env.close()


Deep Q-Network with Keras

Deep Q-Networks (DQNs) are an extension of Q-Learning that utilize deep neural networks to approximate the Q-value function, making them suitable for environments with large state spaces where maintaining a traditional Q-table is impractical.

Key Features of DQNs

  1. Use Deep Neural Networks (DNNs) to Approximate Q-Values
    • Instead of storing a Q-table, DQNs use a deep neural network to estimate Q-values for each state-action pair.
    • The input to the network is the state, and the output is the Q-value for each possible action.
  2. Experience Replay
    • Stores agent experiences (state, action, reward, next_state, done) in a memory buffer.
    • During training, experiences are randomly sampled in mini-batches to break correlation between consecutive updates and improve stability.
  3. Target Network
    • Uses a separate network (target network) to generate target Q-values.
    • The target network is updated less frequently than the main Q-network, reducing instability caused by rapid Q-value fluctuations.

How DQN Works (Overview of the Algorithm)

  1. Initialize the environment and the deep Q-network.
  2. For each episode:
    • Observe the initial state.
    • Select an action using an ε-greedy policy (exploration vs. exploitation).
    • Take the action, observe reward and next state.
    • Store experience in the replay memory.
    • Sample a mini-batch from memory.
    • Compute the target Q-value using the Bellman equation:
Deep Q - Bellman equation.png
    • Train the Q-network to minimize the difference between the predicted and target Q-values.
    • Periodically update the target network.
    • Reduce exploration rate (ε) over time.
      3. Repeat until convergence to an optimal policy.

Why Use DQNs?

Handles large state spaces where Q-tables are infeasible.

Stabilizes learning using experience replay and target networks.

Scalable to complex tasks like Atari games and robotics.


Steps to implement DQNs with Keras

1. Initialize the environment and parameters

  • Define an environment like OpenAI’s Gym
  • Set the hyperparameters for training
  • Initialize the replay buffer

import gym

import numpy as np

from collections import deque

# Initialize the environment

env = gym.make('CartPole-v1')

# Set hyperparameters

alpha = 0.001  # Learning rate

gamma = 0.99  # Discount factor

epsilon = 1.0  # Exploration rate

epsilon_min = 0.01

epsilon_decay = 0.995

episodes = 1000

batch_size = 64

memory_size = 2000

# Initialize replay buffer

memory = deque(maxlen=memory_size)

# Get state and action sizes

state_size = env.observation_space.shape[0]

action_size = env.action_space.n


2. Build the Q-Network and target network

  • Build primary Q-Network and target network
  • Target network provides stable Q-value targets during training

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

from tensorflow.keras.optimizers import Adam

# Function to build the Q-Network

def build_q_network(state_size, action_size):

    model = Sequential()

    model.add(Dense(24, input_dim=state_size, activation='relu'))

    model.add(Dense(24, activation='relu'))

    model.add(Dense(action_size, activation='linear'))

   

    model.compile(loss='mse', optimizer=Adam(learning_rate=alpha))

    return model

# Build the Q-Network and Target Network

q_network = build_q_network(state_size, action_size)

target_network = build_q_network(state_size, action_size)

target_network.set_weights(q_network.get_weights())  # Synchronize target network weights


3. Implement experience replay

  • Store agent’s experiences in a replay buffer
  • Sample random minibatches from the buffer to update the Q-Network

# Function to store experiences in the replay buffer

def remember(state, action, reward, next_state, done):

    memory.append((state, action, reward, next_state, done))

# Function to train the Q-network using experience replay

def replay(batch_size):

    minibatch = np.random.choice(len(memory), batch_size, replace=False)

    for index in minibatch:

        state, action, reward, next_state, done = memory[index]

        target = q_network.predict(state)

       

        if done:

            target[0][action] = reward  # No future reward if done

        else:

            t = target_network.predict(next_state)

            target[0][action] = reward + gamma * np.amax(t)  # Bellman Equation

       

        q_network.fit(state, target, epochs=1, verbose=0)  # Train the Q-network


4. Train the Q-Network

  • Iteratively update the Q-values using the Bellman equation.
  • Update using the gradients computed from the loss between the predicted Q-values and the target Q-values.

import gym

import numpy as np

from collections import deque

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

from tensorflow.keras.optimizers import Adam

# Initialize the environment

env = gym.make('CartPole-v1')

# Set hyperparameters

alpha = 0.001  # Learning rate

gamma = 0.99   # Discount factor

epsilon = 1.0  # Exploration rate

epsilon_min = 0.01

epsilon_decay = 0.995

episodes = 100

batch_size = 64

memory_size = 2000

# Initialize replay buffer

memory = deque(maxlen=memory_size)

# Get state and action sizes

state_size = env.observation_space.shape[0]

action_size = env.action_space.n

# Function to build the Q-Network

def build_q_network(state_size, action_size):

    model = Sequential()

    model.add(Dense(24, input_dim=state_size, activation='relu'))

    model.add(Dense(24, activation='relu'))

    model.add(Dense(action_size, activation='linear'))

   

    model.compile(loss='mse', optimizer=Adam(learning_rate=alpha))

    return model

# Build the Q-Network and Target Network

q_network = build_q_network(state_size, action_size)

target_network = build_q_network(state_size, action_size)

target_network.set_weights(q_network.get_weights())  # Synchronize target network weights

# Function to store experiences in the replay buffer

def remember(state, action, reward, next_state, done):

    memory.append((state, action, reward, next_state, done))

# Function to train the Q-network using experience replay

def replay(batch_size):

    minibatch = np.random.choice(len(memory), batch_size, replace=False)

    for index in minibatch:

        state, action, reward, next_state, done = memory[index]

        target = q_network.predict(state)

       

        if done:

            target[0][action] = reward  # No future reward if done

        else:

            t = target_network.predict(next_state)

            target[0][action] = reward + gamma * np.amax(t)  # Bellman Equation

       

        q_network.fit(state, target, epochs=1, verbose=0)  # Train the Q-network

# Main training loop

for episode in range(episodes):

    state, info = env.reset()

    state = np.reshape(state, [1, state_size])

    total_reward = 0

    for time in range(500):

        if np.random.rand() <= epsilon:

            action = np.random.choice(action_size)  # Exploration

        else:

            q_values = q_network.predict(state, verbose=0)

            action = np.argmax(q_values[0])  # Exploitation

        next_state, reward, done, truncated, info = env.step(action)

        next_state = np.reshape(next_state, [1, state_size])

        total_reward += reward

        if done:

            reward = -10  # Negative reward for losing

            remember(state, action, reward, next_state, done)

            break

        state = next_state

        if len(memory) > batch_size:

            replay(batch_size)

    # Reduce exploration rate over time

    if epsilon > epsilon_min:

        epsilon *= epsilon_decay

    # Update the target network weights every 10 episodes

    if episode % 10 == 0:

        target_network.set_weights(q_network.get_weights())

    print(f"Episode: {episode+1}/{episodes}, Score: {total_reward}")


5. Evaluate the agent

  • Let the agent interact with the environment using the learned policy.
  • Measure performance by the cumulative reward obtained over multiple episodes.

# Evaluate the trained agent

for episode in range(10):

    state, info = env.reset()

    state = np.reshape(state, [1, state_size])

    total_reward = 0

    for time in range(500):

        env.render()  # Render environment to visualize the agent

        q_values = q_network.predict(state, verbose=0)

        action = np.argmax(q_values[0])  # Select best action (greedy policy)

        next_state, reward, done, trunc, _ = env.step(action)

        next_state = np.reshape(next_state, [1, state_size])

        total_reward += reward

        state = next_state

        if done:

            print(f"Episode: {episode+1}, Score: {total_reward}")

            break

env.close()