Mastering AI
- Mathematics for Machine Learning
a. Linear Algebra
1. Scalars, Vectors, Matrices, and Tensors
Scalars: A scalar is a single number. In most contexts, we are talking about real numbers. So, for example, 5 is a scalar.
Vectors: A vector is an ordered array of numbers. These numbers can represent anything, but in the context of machine learning, they often represent feature values for a data point. For example, a vector could be [4, 2, 9] where each number is a different feature value.
Matrices: A matrix is a 2D array of numbers. So, for example, a matrix might look like this:
1 2 3
4 5 6
7 8 9
Each row could represent a different data point, and each column represents a different feature.
Tensors: A tensor is a generalization of scalars, vectors, and matrices to higher dimensions. A scalar is a 0D tensor, a vector is a 1D tensor, a matrix is a 2D tensor, and if you have an array with three indices, that's a 3D tensor.
These concepts are fundamental to the understanding of data in machine learning, as datasets are often represented as matrices or tensors. Additionally, machine learning models often perform operations on these data structures, such as dot product, matrix multiplication, etc. which are used to learn from data and make predictions.
2. Basic Operations
Understanding these will be vital in implementing and understanding machine learning algorithms.
Vector Addition and Subtraction: This is done element-wise. If you have two vectors a = [a1, a2] and b = [b1, b2], their addition would be [a1+b1, a2+b2] and subtraction would be [a1-b1, a2-b2].
Scalar Multiplication and Division: Each element in the vector or matrix is multiplied or divided by the scalar. If the scalar is c and vector is a = [a1, a2], scalar multiplication would result in [ca1, ca2].
Dot Product: This is an operation that takes two equal-length sequences of numbers (usually coordinate vectors) and returns a single number. The dot product of [a1, a2] and [b1, b2] is (a1b1 + a2b2).
Cross Product: This operation takes in two vectors and produces a vector as output. The cross product of the vectors a and b points in a direction perpendicular to both a and b, with a magnitude equal to the area of the parallelogram that a and b span.
Matrix Addition and Subtraction: Just like vectors, this is done element-wise.
Matrix Multiplication: It's not done element-wise. Instead, it involves a series of dot product calculations between the rows of the first matrix and columns of the second. If you're multiplying a matrix A of size (m x n) with a matrix B of size (n x p), the resulting matrix will be of size (m x p).
Matrix Transpose: The transpose of a matrix is achieved by flipping the matrix over its diagonal, switching the row and column indices of each element.
Matrix Inversion: The process of finding a matrix that, when multiplied with the original matrix, results in an identity matrix. Not all matrices are invertible. Invertible matrices are also known as nonsingular or nondegenerate.
Remember, these operations form the backbone of more complex operations and manipulations in machine learning, especially in optimization algorithms and when calculating predictions in models. It's crucial to understand them well.
3. Matrix Types and Operations
Matrix Types:
Diagonal Matrix: A matrix in which the entries outside the main diagonal are all zero. The main diagonal is from top left to bottom right of the matrix.
Identity Matrix: A special type of diagonal matrix where the values on the main diagonal are all ones. It's the equivalent of the number 1 in matrix operations.
Symmetric Matrix: A matrix that is equal to its transpose. This means the elements on the top-left to bottom-right diagonal mirror those on the bottom-left to top-right diagonal.
Orthogonal Matrix: A square matrix whose rows and columns are orthogonal unit vectors (i.e., orthonormal vectors), meaning that their dot product equals zero and the lengths of the vectors are 1.
Matrix Operations:
Transpose: The transpose of a matrix is obtained by flipping it over its diagonal, swapping the row and column indices of each element.
Inverse: The inverse of a matrix A is another matrix, denoted as A^-1, such that when you multiply A and A^-1, you get the identity matrix. It's important to note that not all matrices have an inverse. Matrices that do not have an inverse are called singular or degenerate.
Rank: The rank of a matrix is the maximum number of linearly independent row vectors in the matrix. It gives us a measure of how "full" or "information-rich" a matrix is.
Determinant: The determinant of a matrix is a special number that can be calculated from a square matrix. It provides important information about the matrix and is often used to solve systems of equations.
Eigenvalues and Eigenvectors: These are numbers and vectors associated with a matrix which are used to understand linear transformations described by the matrix. Eigenvalues tell about the magnitude of the transformation, while eigenvectors tell about the direction. They are extensively used in machine learning algorithms, especially in dimensionality reduction techniques like PCA.
Special Matrix Operations:
Trace: The trace of a square matrix is the sum of the elements on the main diagonal (from the top left to the bottom right) of the matrix.
Matrix Factorization (Decomposition): This is the breaking down or deconstructing of a matrix into its constituent parts to make certain operations simpler. An example of this is the LU decomposition or the QR decomposition.
These concepts are crucial in understanding the underlying mathematical computations performed in machine learning algorithms.
4. Vector Spaces
A vector space (also known as a linear space) is a collection of objects called vectors, which may be added together and multiplied ("scaled") by numbers, called scalars in this context. Here are the main topics you should understand:
Definition of a Vector Space: The notion of a vector space relies on four fundamental operations: vector addition, scalar multiplication, scalar addition, and scalar multiplication. If these operations satisfy eight axioms (associative, commutative, identity and inverse elements for addition, compatibility of scalar multiplication with field multiplication, identity element of scalar multiplication, and distributive properties), we say the set of vectors forms a vector space.
Subspaces: Subspaces are a subset of a vector space that still satisfies the properties of vector spaces. They need to include the zero vector, and be closed under vector addition and scalar multiplication.
Linear Combinations and Span: A linear combination of some vectors is an equation that's made up of summing those vectors, each multiplied by a corresponding scalar. The span of a set of vectors is the set of all possible linear combinations of the vectors.
Basis and Dimension: A basis of a vector space is a set of linearly independent vectors that span the whole vector space. The dimension of a vector space is the number of vectors in its basis. For example, in R², a basis could be two vectors at right angles, and so its dimension is 2.
Linear Independence and Dependence: A set of vectors is linearly independent if no vector in the set can be defined as a linear combination of the others. If a vector can be defined as a linear combination of others, then they are linearly dependent.
Orthogonality and Orthonormality: Two vectors are orthogonal if their dot product is zero. A set of vectors is orthonormal if all vectors in the set are orthogonal to each other and each of unit length.
Linear Transformations: These are functions between two vector spaces that preserve the operations of vector addition and scalar multiplication.
Understanding the concept of vector spaces is fundamental to many machine learning algorithms, especially those that use geometric or topological properties of data such as k-Nearest Neighbors, Support Vector Machines, and Principal Component Analysis.
5. Norms and Distance Metrics
These concepts play a crucial role in machine learning, especially in clustering and nearest neighbors algorithms.
Norms: A norm on a vector space is a function from vectors to non-negative values that behaves in certain ways like the absolute value function. A norm provides a notion of distance from the origin, magnitude, or length in the vector space.
L1 Norm (Manhattan Distance): The L1 norm of a vector is the sum of the absolute values of its elements. It's often used when differences of 0 and near 0 are important.
L2 Norm (Euclidean Distance): The L2 norm calculates the distance of the vector coordinates from the origin of the vector space. As such, it is also known as the Euclidean norm as it's calculated as the Euclidean distance from the origin. It is called a norm because it meets the properties of norm functions.
Distance Metrics: These are functions that define a distance between pairs of points. They are used in many machine learning algorithms to compute the similarity between instances.
Euclidean Distance: This is perhaps the most commonly used distance metric. It's defined as the square root of the sum of the absolute squares of the differences of the coordinates.
Manhattan Distance: This is computed as the sum of the absolute differences of their coordinates. It is called Manhattan distance because it's similar to how you might navigate on a grid-based city like Manhattan.
Cosine Similarity: This is a measure of similarity between two vectors. It is calculated as the dot product of the vectors divided by the product of the vectors' magnitudes. Cosine similarity is particularly used in positive space where the outcome is neatly bounded in [0,1].
Remember, different types of problems and data may require different norms or distance metrics. A crucial part of applying machine learning effectively is understanding which of these tools is appropriate for a given situation.
6. Linear Transformations and Matrices
Linear transformations are a fundamental part of linear algebra, and they're intimately related to systems of linear equations. A transformation is just a function that takes an input and produces an output, and a linear transformation is one that has two additional properties:
Additivity: T(u + v) = T(u) + T(v) for any vectors u and v in the vector space.
Scalar multiplication: T(cv) = cT(v) for any vector v in the vector space and any scalar c.
In essence, these properties mean that a linear transformation is a transformation that preserves the operations of vector addition and scalar multiplication.
Now, what's really powerful about linear transformations is that every linear transformation can be represented by a matrix, and the action of applying the transformation to a vector can be represented by multiplying the matrix with the vector.
Therefore, a matrix can be viewed as a way of representing a linear transformation. This leads to the concept of Matrix Transformations, where you learn to represent any linear transformation in terms of matrix multiplication.
In the context of machine learning, linear transformations are used frequently in both data pre-processing (like PCA for dimensionality reduction) and within machine learning models themselves (like the rotations and scaling within a neural network).
To master linear transformations, you should understand the concepts of eigenvectors and eigenvalues, as they give insights about the transformation, such as the directions in which the transformation occurs and the magnitude of the transformation.
Keep in mind that this topic is a core concept of linear algebra and is crucial for understanding the underlying operations of many machine learning algorithms.
7. Eigenvalues and Eigenvectors
Eigenvalues and Eigenvectors are fundamental in the field of linear algebra and play pivotal roles in many machine learning algorithms, especially in dimensionality reduction techniques like PCA.
Eigenvectors: An eigenvector of a square matrix A is a non-zero vector v such that when A is multiplied by v, the result is a scalar multiple of v. In other words, the direction of v is unchanged by the transformation A. This can be written as Av = λv, where λ is a scalar known as the eigenvalue corresponding to this eigenvector.
Eigenvalues: These are the scalars λ associated with each eigenvector. They indicate the scalar factor by which the corresponding eigenvector is stretched or squished.
Eigenbasis: If a vector space has a basis that consists entirely of eigenvectors, it's known as an Eigenbasis. The matrix representation of a linear transformation in an Eigenbasis has a special form: it is a diagonal matrix, where each entry on the diagonal is the eigenvalue corresponding to the eigenvector in the basis.
Eigendecomposition: This is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors. If a matrix A has n linearly independent eigenvectors, then A can be written as PDP⁻¹, where P is the matrix whose columns are the eigenvectors of A, D is the diagonal matrix whose diagonal elements are the corresponding eigenvalues, and P⁻¹ is the inverse of P.
Understanding Eigenvalues and Eigenvectors is crucial in machine learning. They are used in many machine learning algorithms, such as PCA for dimensionality reduction, spectral clustering, and understanding the convergence properties of different machine learning algorithms, among others.
8. Singular Value Decomposition (SVD)
It is an important concept and a method for transforming correlated variables into a set of uncorrelated ones that better expose the various relationships among the original data items.
Singular Value Decomposition (SVD): SVD is a method of decomposing a matrix into three other matrices: A matrix A of size (m x n) can be decomposed into UΣVᵀ where:
U (an m x m matrix) and V (an n x n matrix) are orthogonal matrices. The columns of U are the left singular vectors and the columns of V are the right singular vectors of A.
Σ (an m x n matrix) is a diagonal matrix. The diagonal entries are the singular values of A.
Applications of SVD: The utility of the SVD in data science and machine learning is multifold:
Dimensionality Reduction: In this context, SVD is used to reduce the number of features to a smaller set of uncorrelated components, while retaining most of the information of the original features. This is most notably performed using Principal Component Analysis (PCA), which uses SVD under the hood.
Latent Semantic Analysis (LSA): In natural language processing, LSA uses SVD to identify relationships between terms and topics in text.
Image Compression: SVD can be used to approximate an image using fewer bits of information than the original image.
Collaborative Filtering: SVD is used in collaborative filtering to predict people's item ratings in recommendation systems.
Remember, understanding the intuition and application of SVD is key to understanding many advanced machine learning algorithms. It is a fundamental concept in linear algebra and provides a way to calculate and work with fewer dimensions, thus reducing noise and speeding up computations.
9. Principal Component Analysis (PCA)
Principal Component Analysis, or PCA, is a statistical procedure used for dimensionality reduction in data. It simplifies the complexity in high-dimensional data while retaining trends and patterns.
Here's a detailed explanation!
PCA Procedure: The procedure involves identifying the direction in the multi-dimensional space along which the data varies the most. In other words, it finds the principal components of the data. The first principal component captures the most variance in the data. Then, PCA identifies other components orthogonal to the first component that account for the remaining variance in the data. Each subsequent component accounts for less variance.
Step by Step Process:
Standardize the Data: It's important to standardize the data (mean = 0, standard deviation = 1) to ensure that the scale of the variables doesn't affect the principal components.
Covariance Matrix Computation: PCA uses the covariance between the variables to identify directions in which the data varies the most.
Compute the Eigenvalues and Eigenvectors: The eigenvectors of the covariance matrix represent the principal components (directions of maximum variance), while the corresponding eigenvalues will give the amount of variance carried by each Principal Component.
Sort Eigenvalues and Corresponding Eigenvectors: After calculating the eigenvalues and vectors, sort them in descending order. The eigenvector with the highest corresponding eigenvalue is the first principal component.
Transformed Data: Once the principal components are derived, you can transform the original data onto these components to get the new set of variables.
Advantages of PCA:
Reduces Overfitting: Reducing the data dimensions can help minimize the risk of overfitting.
Improves Algorithm Performance: By decreasing the data's dimensionality, we can speed up the learning algorithm.
Data Visualization: It allows us to visualize high-dimensional data in a 2D or 3D space.
Limitations of PCA:
Independent Variables Become Less Interpretable: The transformed variables after PCA are not as readable or interpretable as the original variables.
Data Standardization is Crucial: If the data is not standardized, then PCA might not work properly.
Not Suitable for Handling Outliers: PCA is sensitive to outliers.
In the context of machine learning, PCA is typically used as an initial preprocessing step before applying a machine learning algorithm, helping to improve performance by reducing feature space and mitigating issues related to the curse of dimensionality.