Skip to content

maths/gaussian.py is wrong #2212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stkain opened this issue Jul 18, 2020 · 9 comments
Closed

maths/gaussian.py is wrong #2212

stkain opened this issue Jul 18, 2020 · 9 comments

Comments

@stkain
Copy link

stkain commented Jul 18, 2020

Hi,

the formula is wrong.

return 1 / sqrt(2 * pi * sigma ** 2) * exp(-((x - mu) ** 2) / 2 * sigma ** 2)
                                                              ^^^^^^^^^^^^^^^
                                                              this actually multiplies with sigma**2 instead of dividing by sigma**2

Instead, for example:

1 / (sqrt(2 * pi) * sigma) * exp(- 0.5 * ( (x - mu) / sigma     )**2 )

avoids squaring sigma and taking the sqrt again in the factor before the exponential

and first calculating (x-mu) / sigma and then squaring saves you squaring numerator and denominator separately
in the exponential.

Bye,
Stefan

@cclauss
Copy link
Member

cclauss commented Jul 18, 2020

@QuantumNovice Could we create a doctest with one of these to prove this claim:
https://docs.scipy.org/doc/scipy/reference/search.html?q=Gaussian

@spamegg1
Copy link
Contributor

spamegg1 commented Jul 25, 2020

Defining it as the OP did, gives (in Python 3.8.2 shell) exactly the same results as in the docstring of maths/gaussian.py:

>>> from numpy import exp, pi, sqrt
>>> def gaussian(x, mu: float = 0.0, sigma: float = 1.0) -> int:
	return 1 / (sqrt(2 * pi) * sigma) * exp(-0.5 * ((x - mu) / sigma)**2)

>>> gaussian(1)
0.24197072451914337
>>> gaussian(24)
3.342714441794458e-126
>>> import numpy as np
>>> x = np.arange(15)
>>> gaussian(x)
array([3.98942280e-01, 2.41970725e-01, 5.39909665e-02, 4.43184841e-03,
       1.33830226e-04, 1.48671951e-06, 6.07588285e-09, 9.13472041e-12,
       5.05227108e-15, 1.02797736e-18, 7.69459863e-23, 2.11881925e-27,
       2.14638374e-32, 7.99882776e-38, 1.09660656e-43])
>>> gaussian(15)
5.530709549844416e-50
>>> gaussian([1,2, 'string'])
Traceback (most recent call last):
  File "<pyshell#14>", line 1, in <module>
    gaussian([1,2, 'string'])
  File "<pyshell#7>", line 2, in gaussian
    return 1 / (sqrt(2 * pi) * sigma) * exp(-0.5 * ((x - mu) / sigma)**2)
TypeError: unsupported operand type(s) for -: 'list' and 'float'
>>> gaussian('hello world')
Traceback (most recent call last):
  File "<pyshell#15>", line 1, in <module>
    gaussian('hello world')
  File "<pyshell#7>", line 2, in gaussian
    return 1 / (sqrt(2 * pi) * sigma) * exp(-0.5 * ((x - mu) / sigma)**2)
TypeError: unsupported operand type(s) for -: 'str' and 'float'
>>> gaussian(10**234)
Traceback (most recent call last):
  File "<pyshell#16>", line 1, in <module>
    gaussian(10**234)
  File "<pyshell#7>", line 2, in gaussian
    return 1 / (sqrt(2 * pi) * sigma) * exp(-0.5 * ((x - mu) / sigma)**2)
OverflowError: (34, 'Numerical result out of range')
>>> gaussian(10**-326)
0.3989422804014327
>>> gaussian(2523, mu=234234, sigma=3425)
0.0

None of the search results linked by @cclauss calculates the same function.

I was able to find an implementation from the authors of the famous Algorithms textbooks (Sedgewick et al):
https://introcs.cs.princeton.edu/python/22module/gaussian.py.html

This also gives exactly the same results as the docstring of maths/gaussian.py (with the exception of supporting NumPy arrays of course, and the Overflow error):

>>> import math
>>> def pdf(x, mu=0.0, sigma=1.0):
    x = float(x - mu) / sigma
    return math.exp(-x*x/2.0) / math.sqrt(2.0*math.pi) / sigma

>>> pdf(1)
0.24197072451914337
>>> pdf(24)
3.342714441794458e-126
>>> pdf(15)
5.530709549844416e-50
>>> pdf([1,2, 'string'])
Traceback (most recent call last):
  File "<pyshell#32>", line 1, in <module>
    pdf([1,2, 'string'])
  File "<pyshell#28>", line 2, in pdf
    x = float(x - mu) / sigma
TypeError: unsupported operand type(s) for -: 'list' and 'float'
>>> pdf('hello world')
Traceback (most recent call last):
  File "<pyshell#33>", line 1, in <module>
    pdf('hello world')
  File "<pyshell#28>", line 2, in pdf
    x = float(x - mu) / sigma
TypeError: unsupported operand type(s) for -: 'str' and 'float'
>>> pdf(10**234)
0.0
>>> pdf(10**-326)
0.3989422804014327
>>> pdf(2523, mu=234234, sigma=3425)
0.0

So... maths/gaussian.py seems correct, but the OP might be right about the performance (unnecessary squaring and square-rooting).

@cclauss
Copy link
Member

cclauss commented Jul 25, 2020

Thanks for the explanation. Should we close? @stkain, your sense?

@stkain
Copy link
Author

stkain commented Jul 26, 2020

Hello folks,

@spamegg1 's calls gave the same results because sigma was never specified (defaulting to 1.0)

Try something like:

import numpy as np 
import matplotlib.pyplot as plt 

def wrong_gaussian(x, mu, sigma): # wrong gaussian from math/gaussian.py
    return 1 / sqrt(2 * pi * sigma ** 2) * exp(-((x - mu) ** 2) / 2 * sigma ** 2) 

def gaussian(x, mu, sigma): # propper gaussian
    return 1 / (sqrt(2 * pi) * sigma) * exp(- 0.5 * ( (x - mu) / sigma     )**2 ) 

x = np.linspace(-5, 5, 1024) 
plt.figure(num=1, figsize=(10, 5)) 
plt.clf() 
plt.plot(x, gaussian(x, 0, 1), 'k', label="normal distribution with sigma=1.0") 
plt.plot(x, gaussian(x, 0, 2), 'b', label="proper gaussian with sigma=2.0") 
plt.plot(x, wrong_gaussian(x, 0, 2), 'r', label="wrong gaussian with sigma=2.0") 
plt.legend() 
plt.xlabel('x') 
plt.ylabel('pdf') 
plt.grid()      

in ipython or jupyter
and you will notice the difference.
Larger sigma should make the bell curve flatter and broader, not narrower.
That is why sigma must be in the denominator, not the numerator.

Bye,
Stefan

@stkain
Copy link
Author

stkain commented Jul 26, 2020

Correction.
The snippets above only work with ipython --pylab (my default :-) )

The prefixes for numpy and pyplot have to be explicitly specified.
So instead of the above, try this:

import numpy as np 
import matplotlib.pyplot as plt 

def wrong_gaussian(x, mu, sigma): # wrong gaussian from math/gaussian.py
    return 1 / np.sqrt(2 * np.pi * sigma ** 2) * np.exp(-((x - mu) ** 2) / 2 * sigma ** 2) 

def gaussian(x, mu, sigma): # propper gaussian
    return 1 / (np.sqrt(2 * np.pi) * sigma) * np.exp(- 0.5 * ( (x - mu) / sigma     )**2 ) 

x = np.linspace(-5, 5, 1024) 
plt.figure(num=1, figsize=(10, 5)) 
plt.clf() 
plt.plot(x, gaussian(x, 0, 1), 'k', label="normal distribution with sigma=1.0") 
plt.plot(x, gaussian(x, 0, 2), 'b', label="proper gaussian with sigma=2.0") 
plt.plot(x, wrong_gaussian(x, 0, 2), 'r', label="wrong gaussian with sigma=2.0") 
plt.legend() 
plt.xlabel('x') 
plt.ylabel('pdf') 
plt.grid()

@stkain
Copy link
Author

stkain commented Jul 26, 2020

The correct formula for the gaussian:
propper_gauss

The formula, as implemented in math/gaussian.py:
wrong_gauss

Please notice the difference in the exponential:
the sigma squared in the denominator of the proper definition versus the sigma squared as a multipying factor in the wrong definition.

@spamegg1
Copy link
Contributor

spamegg1 commented Jul 26, 2020

Hmm OK. In that case the docstring provides insufficient testing. It should specify some different sigma values. The only case with a specified sigma value returns 0.

Let me compare maths/gaussian.py against Sedgewick's implementation again:

>>> import math
>>> def pdf(x, mu=0.0, sigma=1.0):
    x = float(x - mu) / sigma
    return math.exp(-x*x/2.0) / math.sqrt(2.0*math.pi) / sigma

>>> from numpy import exp, pi, sqrt
>>> def gaussian(x, mu: float = 0.0, sigma: float = 1.0) -> int:
	return 1 / sqrt(2 * pi * sigma ** 2) * exp(-((x - mu) ** 2) / 2 * sigma ** 2)

>>> pdf(1, 2, 3)
0.12579440923099774
>>> gaussian(1, 2, 3)
0.001477282803979336

They are indeed different!

Adding parentheses around 2 * sigma ** 2 fixes it:

>>> def gaussian(x, mu: float = 0.0, sigma: float = 1.0) -> int:
	return 1 / sqrt(2 * pi * sigma ** 2) * exp(-((x - mu) ** 2) / (2 * sigma ** 2))

>>> gaussian(1, 2, 3)
0.12579440923099774

Strangely, for larger values they are the same (I guess the difference is further down the decimals):

>>> pdf(10, mu=10, sigma=10)
0.03989422804014327
>>> gaussian(10, mu=10, sigma=10)
0.03989422804014327
>>> pdf(10, mu=10, sigma=100)
0.003989422804014327
>>> gaussian(10, mu=10, sigma=100)
0.003989422804014327

OK, so I propose that the implementation in maths/gaussian.py should just add parentheses around the 2 * sigma ** 2 (then it gives the correct answer), and add some docstring cases with small sigma values that are different than 1.

If the OP would like to make a PR that is fine. @cclauss You can close it once someone makes a PR.

@spamegg1
Copy link
Contributor

spamegg1 commented Aug 1, 2020

#2249 was merged in, should this be closed now?

@stkain
Copy link
Author

stkain commented Aug 6, 2020

Hi,

thx. Yes, the message can be closed.

Bye,
Stefan

@stkain stkain closed this as completed Aug 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants