Frequent Segmentation Fault in Distributed setting #610
Unanswered
RefatIsmail96
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Affects: PythonCall
Describe the bug
Hi, I am running a parallel computation on a Cluster that uses Slurm. I use SlurmClusterManager package to initialize Julia processes. Each process uses some python library (mainly Stim and mwpf). I frequently get segmentation fault with very long error messages. I cannot replicate the issue locally using Distributed package (only happens when I run parallel computation on cluster).
Here's part of the error message:
"srun: error: nid004513: task 236: Segmentation fault
srun: Terminating StepId=38285639.0
slurmstepd: error: *** STEP 38285639.0 ON nid004513 CANCELLED AT 2025-05-02T17:03:53 ***
srun: error: nid004513: tasks 10,26,46,78,142,184,199,225,244: Terminated
srun: error: nid004513: tasks 32,213: Terminated
srun: error: nid004513: tasks 9,11,14-15,25,31,40,143: Terminated
[954040] signal 11 (1): Segmentation fault
in expression starting at none:1
pymalloc_alloc at /usr/local/src/conda/python-3.12.10/Objects/obmalloc.c:1544 [inlined]
_PyObject_Malloc at /usr/local/src/conda/python-3.12.10/Objects/obmalloc.c:1564 [inlined]
PyObject_Malloc at /usr/local/src/conda/python-3.12.10/Objects/obmalloc.c:801 [inlined]
PyLong_FromMedium at /usr/local/src/conda/python-3.12.10/Objects/longobject.c:210 [inlined]
PyLong_FromLong at /usr/local/src/conda/python-3.12.10/Objects/longobject.c:306
PyLong_FromLongLong at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/C/pointers.jl:303 [inlined]
pyint at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/builtins.jl:719
Py at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/Py.jl:144 [inlined]
pytuple_setitem at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/builtins.jl:897
unknown function (ip: 0x7f9cfef57dd3)
unknown function (ip: 0x7f9cfef57999)
unknown function (ip: 0x7f9cfef578d4)
macro expansion at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/builtins.jl:0 [inlined]
pytuple_fromiter at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/builtins.jl:923 [inlined]
#pycall#21 at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/builtins.jl:242 [inlined]
pycall at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/builtins.jl:233 [inlined]
##11 at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/Py.jl:357 [inlined]
Py at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/Py.jl:357 [inlined]
#98 at ./none:0
unknown function (ip: 0x7f9cfef577b2)
iterate at ./generator.jl:48 [inlined]
collect_to! at ./array.jl:849
collect_to_with_first! at ./array.jl:827 [inlined]
collect at ./array.jl:801
compile at /global/homes/r/rismail/.julia/dev/QuantumErrorCorrection/lib/QECDecoders/src/decoders/mwpf_decoder.jl:35
unknown function (ip: 0x7f9cfef571ba)
compile_decoders_on_all_workers at /global/u2/r/rismail/EarlyFT/src/decode/worker_fns.jl:344
unknown function (ip: 0x7f9cfef531d2)
jl_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_call at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/interpreter.c:126
eval_value at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/interpreter.c:666
jl_interpret_toplevel_thunk at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/interpreter.c:824
jl_toplevel_eval_flex at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/toplevel.c:943
jl_toplevel_eval_flex at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/toplevel.c:994
eval at ./boot.jl:430
jfptr_eval_28294.1 at /global/common/software/nersc9/julia/1.11.4/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
jl_f__call_latest at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/builtins.c:875
jl_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/builtins.c:831
#invokelatest#2 at ./essentials.jl:1055
jl_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/builtins.c:831
invokelatest at ./essentials.jl:1052
jl_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/builtins.c:831
#114 at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:303
run_work_thunk at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:70
unknown function (ip: 0x7f9cfef3badb)
run_work_thunk at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:79
#100 at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:88
unknown function (ip: 0x7f9cfef3b61f)
jl_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
start_task at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/task.c:1202
Allocations: 26958566 (Pool: 26957709; Big: 857); GC: 20"
I looked over all other issues for segmentation fault, and I could not find similar issues. I would appreciate any help in narrowing down the potential sources of error here. It is hard to create a MWE (yet) since the error is non-deterministic.
Your system
Please provide detailed information about your system:
-CondaPkg: 0.2.24
Beta Was this translation helpful? Give feedback.
All reactions