A neural network is not magic. It is a sequence of matrix multiplications followed by nonlinear functions, repeated layer by layer. In this lesson you will build one from scratch using nothing but NumPy, and you will see every number that flows through it. By the end you will have trained a network to solve problems that linear models cannot touch. #NeuralNetworks #Backpropagation #NumPy
A Single Neuron
Every neural network starts with this building block: a neuron takes inputs, multiplies each by a weight, adds a bias, and passes the result through an activation function.
A Single Neuron
──────────────────────────────────────────
Inputs Weights Sum + Bias Activation
────── ─────── ────────── ──────────
x1 ─── w1 ──┐
├──► z = w1*x1 + w2*x2 + b ──► a = f(z)
x2 ─── w2 ──┘
f(z) is a nonlinear function (sigmoid, ReLU, etc.)
Without f(z), stacking layers would collapse to a single linear transform.
Why Nonlinearity Matters
If every neuron were just a linear function (output = weights * inputs + bias), then stacking 100 layers would still produce a linear function. You could replace the entire network with a single layer. Nonlinear activation functions break this limitation and allow networks to learn curved decision boundaries, thresholds, and complex patterns.
import numpy as np
np.random.seed(42)
# A single neuron
defsingle_neuron(x, w, b):
"""Weighted sum + bias, then sigmoid activation."""
z = np.dot(x, w) + b
a =1/ (1+ np.exp(-z)) # sigmoid
return a
# Two inputs
x = np.array([0.5, 0.8])
w = np.array([0.3, -0.1])
b =0.2
output =single_neuron(x, w, b)
print(f"Inputs: {x}")
print(f"Weights: {w}")
print(f"Bias: {b}")
print(f"Weighted sum: {np.dot(x, w)+ b:.4f}")
print(f"After sigmoid: {output:.4f}")
Expected output:
Inputs: [0.5 0.8]
Weights: [0.3 -0.1]
Bias: 0.2
Weighted sum: 0.2700
After sigmoid: 0.5671
The sigmoid squished the value 0.27 to 0.5671 (always between 0 and 1). That is the entire computation inside one neuron.
Activation Functions
Two activation functions cover most practical cases.
import numpy as np
np.random.seed(42)
defsigmoid(z):
return1/ (1+ np.exp(-z))
defsigmoid_derivative(a):
"""Derivative of sigmoid, given the output a = sigmoid(z)."""
Sigmoid squishes any value into (0, 1). Useful for output layers where you want a probability. ReLU (Rectified Linear Unit) passes positive values through and zeros out negatives. It is simple, fast, and avoids the vanishing gradient problem that plagues deep sigmoid networks.
The XOR Problem
XOR is the classic test: the output is 1 when exactly one input is 1, and 0 otherwise. A single layer (linear model) cannot solve this because the classes are not linearly separable. A two-layer network can.
The network learned XOR. A linear model would be stuck at ~50% accuracy on this problem forever.
Understanding Backpropagation
Backpropagation is the chain rule applied backwards through the network. Each layer computes: “How much did my weights contribute to the total error?”
Forward pass: compute predictions layer by layer.
Compute loss: how far off are the predictions? (MSE in our case.)
Output layer gradient: compute how the loss changes with respect to the output weights. This uses the derivative of the sigmoid and the derivative of MSE.
Hidden layer gradient: propagate the error backward through the output weights and multiply by the derivative of ReLU. This tells each hidden neuron how much it contributed to the error.
Update weights: subtract learning_rate * gradient from each weight. Repeat.
Now a more practical problem. Given temperature and vibration readings from an industrial sensor, classify each reading as normal (0), warning (1), or critical (2).
Predicted: Normal (99.01%) | Actual: Normal | correct
...
The network architecture uses softmax in the output layer for multi-class classification. Softmax converts raw scores into probabilities that sum to 1 across all classes.
3-Class Sensor Classification Network
──────────────────────────────────────────
Input (2) Hidden (16) Output (3)
───────── ─────────── ──────────
Temperature ──┬── h1 ┌─ Normal (softmax)
├── h2 │
├── h3 ────────────├─ Warning (softmax)
├── ... │
└── h16 └─ Critical (softmax)
Vibration ────┘
ReLU Probabilities sum to 1.0
Key Takeaways
A Neuron
Weighted sum + bias + activation function. That is the entire computation. Everything else is just organizing neurons into layers.
Forward Pass
Input flows through each layer: multiply by weights, add bias, apply activation. The output of one layer is the input to the next.
Backpropagation
The chain rule applied backwards. Each layer computes its gradient, which tells us how to adjust its weights to reduce the loss.
What Frameworks Do
TensorFlow and PyTorch do exactly what you just coded, but optimized for GPUs, with automatic differentiation, and scaled to millions of parameters. Now you know what happens inside.
Now you know what TensorFlow does internally. It is this: matrix multiplications, nonlinear activations, loss computation, and gradient updates, optimized and scaled to billions of parameters.
Comments