Math in Data science

Photo by Andrea Piacquadio on Pexels.com

Mathematics is called the “queen of the sciences.” Math forms the foundation for many fields including data science. Without math, there is no data science. The science part of data science involves math. If you are a person like me, who hates math must understand the part of math that is involved in data science. Breaking this down, we all have some common questions, What are the math concepts that are used in Data Science? If I struggle with math, will data science be a good option for me? So let’s try to solve these questions.

The main four math concepts that form the pillars of data science are

  1. Statistics and Probability
  2. Linear algebra
  3. Mathematical optimization
  4. Calculus (That doesn’t sound great I hope)

Calculus and linear algebra ( nightmares ) form the foundation for most of the sub-concepts. This blog post will give a quick overview of all these four concepts and then get dive into the day-to-day use of these concepts as a data scientist.

Statistics and Probability

This is the most important part of data science. Data science will be a bad option if you hate statistics. A piece of strong knowledge of statistics and probability will help you understand the processing of data and provides a better interpretation of the process. Looking into some basic definitions, Merriam Webster defined it as “a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data“. This defines the huge role played by statistics in data science.

Now going on with some basic definition on probability, Merriam Webster defined it as the quality or state of being probable. In other words, it is the chance of an event being possible. Probability is used to measure the uncertainty, this will tell us how confident you can be in the conclusions drawn. Statistics will help us find the outcome and probability measures the uncertainty behind them.

Statistics and probability cannot be skimmed(don’t underestimate them because they are not difficult to understand). A piece of strong knowledge of these concepts is a must for a data scientist. 

RESOURCES TO LEARN:-

  1. Statistics with R specialization by Coursera:- https://www.coursera.org/specializations/statistics
  2. Statistics and probability by Khan Academy:- https://www.khanacademy.org/math/statistics-probability

Linear Algebra

The data scientist must understand how a machine learning algorithm works and hence they need to understand the math behind it. According to Merriam Webster,” linear algebra is defined as a branch of mathematics that is concerned with mathematical structures closed under the operations of addition and scalar multiplication and that includes the theory of systems of linear equations, matrices, determinants, vector spaces, and linear transformations“.  In other words, it is the algebra applied to multi-dimensional forms like matrices and vectors. Linear algebra is not important as statistics in data science but it is important in machine learning. So you may ask, Why is it important in machine learning? So the data used in machine learning algorithms at least have a two-dimensional structure. So you must have a basic understanding of vector and matrix operations to understand the working of machine learning algorithms. 

Do we need to learn linear algebra since most of the algorithms can be executed with a single line of python code? The answer is YES! Without knowing we cannot understand how the algorithms work. This will also help us to apply the algorithms optimally and interpret the results from the output.

RESOURCES TO LEARN

Linear algebra courses by edX:- https://www.edx.org/learn/linear-algebra

Mathematical Optimization

Most of them will not be familiar with this topic unless they had math after high school. What is mathematical optimization? let’s look into its definition. Wikipedia states,” Mathematical optimization or mathematical programming is the selection of the best element, with regard to some criterion, from some set of available alternatives”. In other words, it is finding the optimal solution for a problem. For example, we might have encountered some high school problems stating “Find the best possible outcome”. Where do we use this in data science? Like linear algebra, this comes into the picture in most of the machine learning algorithms. The concept of optimization is used in all the algorithms from the simplest to the most complicated. It is used in the process of feature selection and gradient descent algorithms. The gradient descent algorithm is one of the most used optimization algorithms in machine learning.

RESOURCES TO LEARN

Discrete optimization by Coursera:- https://www.coursera.org/learn/discrete-optimization

Calculus

Calculus is involved in all the concepts seen earlier. To become a data scientist you need not be a pro in calculus. You can still be a good data scientist if you are strong with other fundamentals.

Looking into the definition of calculus, Oxford states that it is “a branch of mathematics that deals with the finding and properties of derivatives and integrals of functions, by methods originally based on the summation of infinitesimal differences. The two main types are differential calculus and integral calculus“. 

All you need to know in calculus is derivatives, integrals, limits, and gradients to be able to follow ML algorithms that involve calculus.

RESOURCES TO LEARN

Calculus courses by edX:- https://www.edx.org/learn/calculus

Now let’s move on to the applications of these fields in data science role,

Experimental design- applied as A/B testing.

Data processing – for model preparation and analysis

Modelling- Building \ optimization \ evaluation

Experimental Design

Starting with experimental design the concepts from statistics are used on day to day basis. Experimental design is generally setting the procedures/experiments to test a hypothesis. A/B testing is comparing the two models to see which yields a better outcome. This can be done using statistical testing techniques such as

  1. Sampling techniques
  2. Identification of bias in data
  3. Hypothesis testing
  4. Confidence intervals.

Data processing

Here we need to understand statistics to manipulate the data before sending it into a machine learning algorithm. Some algorithms require a certain format of data and hence we need to process them. Data processing involves,

  1. Detection of outliers(stats)
  2. Normalization of the data(stat)
  3. Standardization(stat)
  4. Dimensionality reduction(linear algebra).

We can see a lot deeper about data processing techniques in my upcoming blog posts.

Modelling

 Moving on to modelling , where linear algebra, optimization and calculus come into the picture. We need all of these concepts to understand what is going on under the algorithms for better optimization and implementation.

Modelling techniques involve,

  1. Regularization(stat)
  2. Variance and bias in the model(stat)
  3. Linear/ logistics regression(stat, linear algebra, optimization, calculus)
  4. Random forest(probability)
  5. Naive bayes(probability), clustering(linear algebra)
  6. Deep learning
  7. Mean square error(stat)
  8. r^2 and adjusted r^2 (stat)
  9. ROC-AUC (stat)
  10. Log-loss(probability)
  11. Benchmarking (probability).

Nothing is tough if you have the interest and willpower to learn. To be a good data scientist is to understand the process that is involved under the hood of the algorithms. I have shared some resources where you can learn these concepts. Hope that this blog was helpful to understand the math involved in data science.

Happy learning and stay tuned for more blogs on data science.

Leave a comment