Descriptive Statistics : Mean, Median and Mode using Python

Descriptive statistics describe the basic features of the data. It simplifies and summarizes large data in a meaningful and sensible manner. Descriptive statistics, however do not allows us to make conclusions beyond the data we have analyzed, it is simply a way to describe data.

Typically there are two ways to describe the data statistically :-

Measure of central tendency :- These are the ways to describe the central position of a frequency distribution. This can be described using mean, mode, median.

Measures of spread- These are the ways of summarizing a group of data by describing how spread the scores are. As an example, mean score of students in a class will be 70, but not all students will have scored 70, few may have scored less and few as more than 70. Measures of spread summarizes how spread out these scores are. This is mainly described by range, quartiles, absolute deviation, variation and standard deviation.

In this post, measures of central tendency will be covered and its implementation using Python.

Mean Mean also known as average is calculated by taking sum of all the values in a data set divided by the number of values.

Median Median in simple terms is the middle value of the data set. It is calculated by arranging the numbers in an ascending order and middle element is selected as median. If there are odd number of elements, then it is obvious to select the middle element. As an example, for data set with 11 elements, the 6th element is the median which is dividing the data set into two parts. However, for a data set with even number of elements, median is calculated by taking mean of middle two elements. So, for a data set with 10 elements, median would be calculated by taking mean of element 5th & 6th.

Mode Mode is the most occuring frequency item in a dataset i.e. the value that occurs most of the time.

Now that we have an idea about Mean, Median and Mode, let’s see how we can calculate these using Python. We will use the Python libraries Pandas and Stats for computing these values.

Next question is what measurement of central tendency to use?

If the data is Categorical (Nominal or Ordinal), use Mode.
If the data is quantitative, use Mean or Median
If there are outliers or highly skewed data, use Median over Mean.

As you can see from the sample code, we have outliers for Subject2, where student 8 & 9 scored significantly higher than the rest of the class. So, in this case for Subject 2, the mean is 21.7 & median as 3.5. So, in this case, it makes sense to use Median over Mean.

Percentile is another common used concept which means a certain percentage of score falls below this number. As an example, if you scored 75 out of 80 and are at 90 percentile , that means you performed well than 90% of the class.

For Quartiles, we divide the data set into 4 quarters and each quarter is 25% of the data set. First quartile or Q1 is the value in data such that 25% of the data points are less than this value and 75% greater than this value. Second quartile means 50% values are less and 50% values are greater than this value. Third quartile means 75% values are less than and 25% values are greater than this value.

{
  "cells": [
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## Mean, Median & Mode using Python "
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Let's assume a dataset of students and their marks in three subjects. First thing we will do is to form a Pandas dataset. "
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "import numpy as np\nimport pandas as pd\nfrom scipy import stats\n\ndata = {\n        'Name'     : ['S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 'S9', 'S10'], \n        'Subject1' : [48, 67, 52, 39, 58, 61, 55, 65, 49, 59],\n        'Subject2' : [4, 3, 6, 3, 2, 2, 3, 5, 98, 91],\n        'Subject3' : ['A', 'B', 'A', 'C', 'B', 'B', 'A', 'B', 'C', 'B'],\n} \n\ndf = pd.DataFrame(data)\ndf",
      "execution_count": 1,
      "outputs": [
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Name</th>\n      <th>Subject1</th>\n      <th>Subject2</th>\n      <th>Subject3</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>S1</td>\n      <td>48</td>\n      <td>4</td>\n      <td>A</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>S2</td>\n      <td>67</td>\n      <td>3</td>\n      <td>B</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>S3</td>\n      <td>52</td>\n      <td>6</td>\n      <td>A</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>S4</td>\n      <td>39</td>\n      <td>3</td>\n      <td>C</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>S5</td>\n      <td>58</td>\n      <td>2</td>\n      <td>B</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>S6</td>\n      <td>61</td>\n      <td>2</td>\n      <td>B</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>S7</td>\n      <td>55</td>\n      <td>3</td>\n      <td>A</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>S8</td>\n      <td>65</td>\n      <td>5</td>\n      <td>B</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>S9</td>\n      <td>49</td>\n      <td>98</td>\n      <td>C</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>S10</td>\n      <td>59</td>\n      <td>91</td>\n      <td>B</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
            "text/plain": "  Name  Subject1  Subject2 Subject3\n0   S1        48         4        A\n1   S2        67         3        B\n2   S3        52         6        A\n3   S4        39         3        C\n4   S5        58         2        B\n5   S6        61         2        B\n6   S7        55         3        A\n7   S8        65         5        B\n8   S9        49        98        C\n9  S10        59        91        B"
          },
          "execution_count": 1,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Pandas dataframe has a method describe which can be used to find the common statistics from a dataframe inlcluding mean, median etc. "
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "df.describe()",
      "execution_count": 2,
      "outputs": [
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Subject1</th>\n      <th>Subject2</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>count</th>\n      <td>10.000000</td>\n      <td>10.000000</td>\n    </tr>\n    <tr>\n      <th>mean</th>\n      <td>55.300000</td>\n      <td>21.700000</td>\n    </tr>\n    <tr>\n      <th>std</th>\n      <td>8.525126</td>\n      <td>38.424674</td>\n    </tr>\n    <tr>\n      <th>min</th>\n      <td>39.000000</td>\n      <td>2.000000</td>\n    </tr>\n    <tr>\n      <th>25%</th>\n      <td>49.750000</td>\n      <td>3.000000</td>\n    </tr>\n    <tr>\n      <th>50%</th>\n      <td>56.500000</td>\n      <td>3.500000</td>\n    </tr>\n    <tr>\n      <th>75%</th>\n      <td>60.500000</td>\n      <td>5.750000</td>\n    </tr>\n    <tr>\n      <th>max</th>\n      <td>67.000000</td>\n      <td>98.000000</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
            "text/plain": "        Subject1   Subject2\ncount  10.000000  10.000000\nmean   55.300000  21.700000\nstd     8.525126  38.424674\nmin    39.000000   2.000000\n25%    49.750000   3.000000\n50%    56.500000   3.500000\n75%    60.500000   5.750000\nmax    67.000000  98.000000"
          },
          "execution_count": 2,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "We can also use the methods provided by Numpy library to find mean and median. In order to do that, first we will convert the dataseries to numpy array."
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "np_array = df[['Subject1', 'Subject2']].to_numpy()\nnp_array2 = df['Subject3'].to_numpy()\n\nprint(np_array)\nprint(np_array2)",
      "execution_count": 3,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "[[48  4]\n [67  3]\n [52  6]\n [39  3]\n [58  2]\n [61  2]\n [55  3]\n [65  5]\n [49 98]\n [59 91]]\n['A' 'B' 'A' 'C' 'B' 'B' 'A' 'B' 'C' 'B']\n"
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "# Axis=0 means that the mean and median is calculated for each column\nnp.mean(np_array, axis=0)",
      "execution_count": 4,
      "outputs": [
        {
          "data": {
            "text/plain": "array([55.3, 21.7])"
          },
          "execution_count": 4,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "np.median(np_array, axis=0)",
      "execution_count": 5,
      "outputs": [
        {
          "data": {
            "text/plain": "array([56.5,  3.5])"
          },
          "execution_count": 5,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Calculate Mode using stats library"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "stats.mode(np_array2)",
      "execution_count": 8,
      "outputs": [
        {
          "data": {
            "text/plain": "ModeResult(mode=array(['B'], dtype=object), count=array([5]))"
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.7.3",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "gist": {
      "id": "",
      "data": {
        "description": "descriptive_stats.ipynb",
        "public": true
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}

Descriptive Statistics : Mean, Median and Mode using Python

Next question is what measurement of central tendency to use?

Related Posts

Linear regression model evaluation metrics using Python

Numpy Tutorial with Jupyter notebook

Comments