You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a couple of systems with high thread counts, namely 20- and 24- core SMT=8 ppc64 systems. We can vary the number of active cores/threads on these machines on demand, from 1-192 threads. For example:
During recent testing in our 32-bit PowerPC environments (this does not affect 64-bit environments), we discovered significant instability (transient test failures) at N~120 as well as reliably repeatable fatal errors at N~160, that are mitigated when lower thread configurations are used. Our highest thread x86_64 machine (running an i586 environment) caps out at 72 and is stable.
All tests are run with the exact same binaries and hardware; the only thing that changes is how many cores/threads are active.
Linux adelie 5.15.132-mc6-easy #1 SMP Sun Nov 12 04:16:43 UTC 2023 ppc GNU/Linux
Current behavior
It is not immediately clear to us if this is an Erlang/OTP issue, or an Elixir issue, however we are filing a report with Elixir first.
On 32-bit systems with high thread counts, instability at N~120 and repeatable test failures at N~160 are being observed.
Transient instability at or above N~120 might look like this:
1) test undefined global no_warn_undefined :all (Module.Types.IntegrationTest)
test/elixir/module/types/integration_test.exs:562
Assertion with == failed
code: assert capture_compile_warnings(files) == ""
left: " warning: redefining module A (current version defined in memory)\n │\n 1 │ defmodule A do\n │ ~~~~~~~~~~~~~~\n │\n └─ a.ex:1: A (module)\n\n"
right: ""
stacktrace:
test/elixir/module/types/integration_test.exs:576: (test)
or it might involve timeouts:
==> mix (ex_unit)
Running ExUnit with seed: 406740, max_cases: 224
Excluding tags: [windows: true]
...
1) test runs in daemon mode (Mix.Tasks.ReleaseTest)
/usr/src/packages/user/elixir/src/elixir-1.17.2/lib/mix/test/mix/tasks/release_test.exs:716
** (ExUnit.TimeoutError) test timed out after 60000ms. You can change the timeout:
1. per test by setting "@tag timeout: x" (accepts :infinity) 2. per test module by setting "@moduletag timeout: x" (accepts :infinity)
3. globally via "ExUnit.start(timeout: x)" configuration
4. by running "mix test --timeout x" which sets timeout
5. or by running "mix test --trace" which sets timeout to infinity
(useful when using IEx.pry/0)
where "x" is the timeout given as integer in milliseconds (defaults to 60_000).
stacktrace:
(elixir 1.17.2) lib/process.ex:303: Process.sleep/1
test/mix/tasks/release_test.exs:807: Mix.Tasks.ReleaseTest.wait_until/1
test/mix/tasks/release_test.exs:743: anonymous fn/1 in Mix.Tasks.ReleaseTest."test runs in daemon mode"/1
(mix 1.17.2) lib/mix/project.ex:463: Mix.Project.in_project/4
(elixir 1.17.2) lib/file.ex:1665: File.cd!/2
test/test_helper.exs:156: MixTest.Case.in_fixture/3
test/mix/tasks/release_test.exs:717: (test)
(ex_unit 1.17.2) lib/ex_unit/runner.ex:485: ExUnit.Runner.exec_test/2
(stdlib 6.0) timer.erl:590: :timer.tc/2
(ex_unit 1.17.2) lib/ex_unit/runner.ex:407: anonymous fn/6 in ExUnit.Runner.spawn_test_monitor/4
warning: redefining module ReleaseTest.MixProject (current version defined in memory)
│
1 │ defmodule ReleaseTest.MixProject do
│ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
│
└─ /usr/src/packages/user/elixir/src/elixir-1.17.2/lib/mix/tmp/Mix.Tasks.ReleaseTest/runs_eval_and_version_commands/mix.exs:1: ReleaseTest.MixProject (module)
however, at N~160 the reliably reproducible errors might look like this:
==> mix (ex_unit)
Running ExUnit with seed: 410624, max_cases: 256
Excluding tags: [windows: true]
... warning: redefining module Mix.Tasks.Local.Sample (current version defined in memory)
│
1 │ defmodule Mix.Tasks.Local.Sample do
│ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
│
└─ lib/local.sample.ex:1: Mix.Tasks.Local.Sample (module)
...
.......std_alloc: Cannot allocate 512 bytes of memory (of type "bpd").
Note that test performance (anecdotally) seems to slow down with higher thread counts, even if the tests are ultimately successful.
A quick calculation, 512 bytes by 160 or 192 threads, suggests N=128 might be near a 16-bit memory ceiling of some sort.
Expected behavior
All tests pass.
We can offer access to this hardware if it will be helpful in debugging.
The text was updated successfully, but these errors were encountered:
Allocation failures are definitely an Erlang issue. At the end of the day, Elixir only ships regular .beam modules. I recommend reporting an issue on Erlang/OTP.
You can also pick an Erlang project, such as rebar3, and try running their tests or the suites in OTP itself, in case you want to gain more confidence before reporting the bug there.
You can also pick an Erlang project, such as rebar3, and try running their tests or the suites in OTP itself, in case you want to gain more confidence before reporting the bug there.
The OTP suite passes with N~192, so we'd need another way to reproduce such a failure and determine the root cause.
I pushed a commit to deal with the module already defined warning. :) I have subscribed and commented on the Erlang discussion, if anything new shows up pointing to Elixir, I can tackle it (or reopen it here).
Elixir and Erlang/OTP versions
We have a couple of systems with high thread counts, namely 20- and 24- core SMT=8 ppc64 systems. We can vary the number of active cores/threads on these machines on demand, from 1-192 threads. For example:
During recent testing in our 32-bit PowerPC environments (this does not affect 64-bit environments), we discovered significant instability (transient test failures) at N~120 as well as reliably repeatable fatal errors at N~160, that are mitigated when lower thread configurations are used. Our highest thread x86_64 machine (running an i586 environment) caps out at 72 and is stable.
All tests are run with the exact same binaries and hardware; the only thing that changes is how many cores/threads are active.
Some examples:
Operating system
Adélie Linux 1.0-BETA5:
Current behavior
It is not immediately clear to us if this is an Erlang/OTP issue, or an Elixir issue, however we are filing a report with Elixir first.
On 32-bit systems with high thread counts, instability at N~120 and repeatable test failures at N~160 are being observed.
Transient instability at or above N~120 might look like this:
or it might involve timeouts:
however, at N~160 the reliably reproducible errors might look like this:
Note that test performance (anecdotally) seems to slow down with higher thread counts, even if the tests are ultimately successful.
A quick calculation, 512 bytes by 160 or 192 threads, suggests N=128 might be near a 16-bit memory ceiling of some sort.
Expected behavior
All tests pass.
We can offer access to this hardware if it will be helpful in debugging.
The text was updated successfully, but these errors were encountered: