Skip to content

[inequality] Incorporate a new exercise on vectorizing the gini_coefficient function #410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mmcky opened this issue Mar 25, 2024 · 3 comments
Assignees

Comments

@mmcky
Copy link
Contributor

mmcky commented Mar 25, 2024

We can write a new exercise in the inequality lecture to teach the difference in python loops and vectorization.

Here is a starting point for the exercise.

```{exercise}
:label: inequality_ex3

The {ref}`code to compute the Gini coefficient is listed in the lecture above <code:gini-coefficient>`.

This code uses loops to calculate the coefficient based on income or wealth data.

This function can be re-written using vectorization which will greatly improve the computational efficiency when using `python`.

Re-write the function `gini_coefficient` using `numpy` and vectorized code.

You can compare the output of this new function with the one above, and note the speed differences. 
```

```{solution-start} inequality_ex3
:class: dropdown
```

Let's take a look at some raw data for the US that is stored in `df_income_wealth`

```{code-cell} ipython3
df_income_wealth.describe()
```

```{code-cell} ipython3
df_income_wealth.head(n=4)
```

We will focus on wealth variable `n_wealth` to compute a Gini coefficient for the year 1990.

```{code-cell} ipython3
data = df_income_wealth[df_income_wealth.year == 2016]
```

```{code-cell} ipython3
data.head(n=2)
```

We can first compute the Gini coefficient using the function defined in the lecture above.

```{code-cell} ipython3
gini_coefficient(data.n_wealth.values)
```

Now we can write a vectorized version using `numpy`

```{code-cell} ipython3
def gini(y):
    n = len(y)
    y_1 = np.reshape(y, (n, 1))
    y_2 = np.reshape(y, (1, n))
    g_sum = np.sum(np.abs(y_1 - y_2))
    return g_sum / (2 * n * np.sum(y))
```

```{code-cell} ipython3
gini(data.n_wealth.values)
```

however this uses a long run time series so it would be better to migrate this to use simulation data that we can control the size and generate in the lecture.

@longye-tian
Copy link
Collaborator

longye-tian commented Jun 21, 2024

Hi Matt @mmcky ,

Maybe we can add the following paragraph to illustrate this vectorized function using simulated data?

Let's simulate five populations by drawing from a lognormal distribution as before

```{code-cell} ipython3
k = 5
σ_vals = np.linspace(0.2, 4, k)
n = 2_000
σ_vals = σ_vals.reshape((k,1))
μ_vals = -σ_vals**2/2
y_vals = np.exp(μ_vals + σ_vals*np.random.randn(n))
```
We can compute the Gini coefficient for these five populations using the vectorized function as follows,

```{code-cell} ipython3
gini_coefficients =[]
for i in range(k):
     gini_coefficients.append(gini(simulated_data[i]))
```

This gives us the Gini coefficients for these five households.

```{code-cell} ipython3
gini_coefficients
```

Best,
Longye

@mmcky
Copy link
Contributor Author

mmcky commented Jul 1, 2024

thanks @longye-tian -- if you can prepare a PR that sounds great. We can work on this together in that branch.

We can add this as an exercise to this lecture.

longye-tian added a commit that referenced this issue Jul 2, 2024
Hi Matt @mmcky ,

I have updated the exercise 3 of the inequality lecture using your code in #410 and add the simulation part below your solution.

What do you think about this version of the solution?

Best,
Longye
mmcky added a commit that referenced this issue Jul 5, 2024
* [inequality] Update exercise 3

Hi Matt @mmcky ,

I have updated the exercise 3 of the inequality lecture using your code in #410 and add the simulation part below your solution.

What do you think about this version of the solution?

Best,
Longye

* Update inequality.md

Hi Matt,
I have updated the solution and in the main text by adding ` %%time`.

What do you think about this comparison?

* Update inequality.md

add labels to the main text gini coefficient code.

* Update inequality.md

* add data.ipynb and delete to csv

Hi Matt,

I have added the data.ipynb to the folder and I think it contains sufficient code to save the data.

I have also modified the contain to deal with the saving and call issues related to the csv.

What do you think about these changes?

Best,
Longye

* remove skip-execution code as it is not compatible with google collab

* test the problem

This commit is to test whether the problem is due to this code.

* Revert "test the problem"

This reverts commit 395657e.

* test google colab RAM

this commit is to test whether the crash is led by the

* change link to notebook on github

* update_inequality_exercise

Hi Matt,

This commit select 3000 random sample from the original dataset.

Best,
Longye

* update year in the text

update year in the text

---------

Co-authored-by: mmcky <[email protected]>
@longye-tian
Copy link
Collaborator

Hi Matt,
I think this issue is closed by pull request #498.

Best,
Longye

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants