Skip to content

Too many threads triggers instability on 32-bit systems / std_alloc: Cannot allocate 512 bytes of memory (of type "bpd") #13774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zv-io opened this issue Aug 9, 2024 · 4 comments

Comments

@zv-io
Copy link

zv-io commented Aug 9, 2024

Elixir and Erlang/OTP versions

We have a couple of systems with high thread counts, namely 20- and 24- core SMT=8 ppc64 systems. We can vary the number of active cores/threads on these machines on demand, from 1-192 threads. For example:

# ppc64_cpu --cores-on=24 && ppc64_cpu --smt=8 && nproc
192

# ppc64_cpu --cores-on=20 && ppc64_cpu --smt=8 && nproc
160

# ppc64_cpu --cores-on=19 && ppc64_cpu --smt=7 && nproc
133

During recent testing in our 32-bit PowerPC environments (this does not affect 64-bit environments), we discovered significant instability (transient test failures) at N~120 as well as reliably repeatable fatal errors at N~160, that are mitigated when lower thread configurations are used. Our highest thread x86_64 machine (running an i586 environment) caps out at 72 and is stable.

All tests are run with the exact same binaries and hardware; the only thing that changes is how many cores/threads are active.

Some examples:

Erlang/OTP 27 [erts-15.0] [source] [32-bit] [smp:64:64] [ds:64:64:10] [async-threads:1]

Elixir 1.17.2 (compiled with Erlang/OTP 27)
Erlang/OTP 27 [erts-15.0] [source] [32-bit] [smp:144:96] [ds:144:96:10] [async-threads:1]

Elixir 1.17.2 (compiled with Erlang/OTP 27)
Erlang/OTP 27 [erts-15.0] [source] [32-bit] [smp:168:112] [ds:168:112:10] [async-threads:1]

Elixir 1.17.2 (compiled with Erlang/OTP 27)
Erlang/OTP 27 [erts-15.0] [source] [32-bit] [smp:192:128] [ds:192:128:10] [async-threads:1]

Elixir 1.17.2 (compiled with Erlang/OTP 27)

Operating system

Adélie Linux 1.0-BETA5:

Linux adelie 5.15.132-mc6-easy #1 SMP Sun Nov 12 04:16:43 UTC 2023 ppc GNU/Linux

Current behavior

It is not immediately clear to us if this is an Erlang/OTP issue, or an Elixir issue, however we are filing a report with Elixir first.

On 32-bit systems with high thread counts, instability at N~120 and repeatable test failures at N~160 are being observed.

Transient instability at or above N~120 might look like this:

  1) test undefined global no_warn_undefined :all (Module.Types.IntegrationTest)                                                        
     test/elixir/module/types/integration_test.exs:562                                                                                  
     Assertion with == failed                                       
     code:  assert capture_compile_warnings(files) == ""                                                                                                                                                                                                                        
     left:  "    warning: redefining module A (current version defined in memory)\n    │\n  1 │ defmodule A do\n    │ ~~~~~~~~~~~~~~\n    │\n    └─ a.ex:1: A (module)\n\n"                                                                                                     
     right: ""                                                      
     stacktrace:                                                                                                                                                                                                                                                                
       test/elixir/module/types/integration_test.exs:576: (test)

or it might involve timeouts:

==> mix (ex_unit)                                                   
Running ExUnit with seed: 406740, max_cases: 224                                                                                                                                                                                                                                
Excluding tags: [windows: true]                                     
                                                                                                                                        
...                                                                    
  1) test runs in daemon mode (Mix.Tasks.ReleaseTest)               
     /usr/src/packages/user/elixir/src/elixir-1.17.2/lib/mix/test/mix/tasks/release_test.exs:716                                        
     ** (ExUnit.TimeoutError) test timed out after 60000ms. You can change the timeout:                                                                                                                                                                                         

       1. per test by setting "@tag timeout: x" (accepts :infinity)                                                                                                                                                                                                                    2. per test module by setting "@moduletag timeout: x" (accepts :infinity)                                                        
       3. globally via "ExUnit.start(timeout: x)" configuration     
       4. by running "mix test --timeout x" which sets timeout      
       5. or by running "mix test --trace" which sets timeout to infinity                                                               
          (useful when using IEx.pry/0)                             

     where "x" is the timeout given as integer in milliseconds (defaults to 60_000).                                                    
                                                                    
     stacktrace:                                                    
       (elixir 1.17.2) lib/process.ex:303: Process.sleep/1          
       test/mix/tasks/release_test.exs:807: Mix.Tasks.ReleaseTest.wait_until/1                                                          
       test/mix/tasks/release_test.exs:743: anonymous fn/1 in Mix.Tasks.ReleaseTest."test runs in daemon mode"/1                        
       (mix 1.17.2) lib/mix/project.ex:463: Mix.Project.in_project/4                                                                    
       (elixir 1.17.2) lib/file.ex:1665: File.cd!/2                 
       test/test_helper.exs:156: MixTest.Case.in_fixture/3          
       test/mix/tasks/release_test.exs:717: (test)                  
       (ex_unit 1.17.2) lib/ex_unit/runner.ex:485: ExUnit.Runner.exec_test/2                                                            
       (stdlib 6.0) timer.erl:590: :timer.tc/2                      
       (ex_unit 1.17.2) lib/ex_unit/runner.ex:407: anonymous fn/6 in ExUnit.Runner.spawn_test_monitor/4                                 

    warning: redefining module ReleaseTest.MixProject (current version defined in memory)                                               
    │                                                               
  1 │ defmodule ReleaseTest.MixProject do                           
    │ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                           
    │                                                               
    └─ /usr/src/packages/user/elixir/src/elixir-1.17.2/lib/mix/tmp/Mix.Tasks.ReleaseTest/runs_eval_and_version_commands/mix.exs:1: ReleaseTest.MixProject (module)

however, at N~160 the reliably reproducible errors might look like this:

==> mix (ex_unit)
Running ExUnit with seed: 410624, max_cases: 256
Excluding tags: [windows: true]

... warning: redefining module Mix.Tasks.Local.Sample (current version defined in memory)
    │
  1 │ defmodule Mix.Tasks.Local.Sample do
    │ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    │
    └─ lib/local.sample.ex:1: Mix.Tasks.Local.Sample (module)
...

.......std_alloc: Cannot allocate 512 bytes of memory (of type "bpd").

Note that test performance (anecdotally) seems to slow down with higher thread counts, even if the tests are ultimately successful.

A quick calculation, 512 bytes by 160 or 192 threads, suggests N=128 might be near a 16-bit memory ceiling of some sort.

Expected behavior

All tests pass.

We can offer access to this hardware if it will be helpful in debugging.

@zv-io
Copy link
Author

zv-io commented Aug 9, 2024

@josevalim
Copy link
Member

Allocation failures are definitely an Erlang issue. At the end of the day, Elixir only ships regular .beam modules. I recommend reporting an issue on Erlang/OTP.

You can also pick an Erlang project, such as rebar3, and try running their tests or the suites in OTP itself, in case you want to gain more confidence before reporting the bug there.

@zv-io
Copy link
Author

zv-io commented Aug 9, 2024

You can also pick an Erlang project, such as rebar3, and try running their tests or the suites in OTP itself, in case you want to gain more confidence before reporting the bug there.

The OTP suite passes with N~192, so we'd need another way to reproduce such a failure and determine the root cause.

I will report to Erlang/OTP shortly.

@josevalim
Copy link
Member

I pushed a commit to deal with the module already defined warning. :) I have subscribed and commented on the Erlang discussion, if anything new shows up pointing to Elixir, I can tackle it (or reopen it here).

@josevalim josevalim closed this as not planned Won't fix, can't repro, duplicate, stale Aug 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants