Skip to content

runtime: SIGSEGV on nil pointer in mheap.freeManual #73628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kris-watts-gravwell opened this issue May 7, 2025 · 3 comments
Open

runtime: SIGSEGV on nil pointer in mheap.freeManual #73628

kris-watts-gravwell opened this issue May 7, 2025 · 3 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.

Comments

@kris-watts-gravwell
Copy link

kris-watts-gravwell commented May 7, 2025

Go version

1.23.8

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/kris/.cache/go-build'
GOENV='/home/kris/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/kris/mygo/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/kris/mygo'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/opt/go/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/go/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.23.8'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/kris/.config/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build2339619411=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Have been seeing this SIGSEGV sporadically happen across a large number of high load services, its pretty rare so I have very very low confidence that I will be able to zero into a good repro.

We have probably 40+ of our backend service running across 10+ large AMD EPYC servers with lots of ECC RAM. The services are running in KVM on top of a fully updated Proxmox environment. The crashing backend service has between 4 vCPUs and 16GB of RAM and 16 vCPUs and 128GB of RAM.

We saw the crash start happening when we transitioned to the 1.23 runtime, it has happened on most of our physical hosts, none of the hosts are reporting memory problems and prior to deployment we did a memtest and stress test on all the hosts. The service typically runs for over a month before we see this sigsev and we never see it on our low-load instances, only high load (more GC activity, so that isn't unexpected).

The backing service makes extensive use of mmap for large disk backed data structures.

We just moved our builds over to the 1.24.X runtime but we haven't run them in production for long enough to see if the crash goes away and I can see that the mheap code is wildly different in 1.24 vs 1.23.

What did you see happen?

Crash with the following backtrace (happens in exactly the same spot every time):

SIGSEGV: segmentation violation
PC=0x42a5bc m=16 sigcode=1 addr=0x64

goroutine 0 gp=0xc000704540 m=16 mp=0xc000081c08 [idle]:
runtime.(*mheap).freeManual(0x29a1940, 0x0, 0x2)
        runtime/mheap.go:1605 +0xbc fp=0xc000c8ffa0 sp=0xc000c8ff70 pc=0x42a5bc
runtime.(*sweepLocked).sweep.func2()
        runtime/mgcsweep.go:826 +0x70 fp=0xc000c8ffc8 sp=0xc000c8ffa0 pc=0x4273b0
runtime.systemstack(0x3240c903240c903)
        runtime/asm_amd64.s:514 +0x4a fp=0xc000c8ffd8 sp=0xc000c8ffc8 pc=0x4778ea

Digging in its pretty clear we are getting a nil pointer from runtime.spanOf which then causes the crash when runtime.freeManual attempts to assign s.needzero = 1

Across all our crashes the arguments to runtime.freeManual are always (0x0, 0x2) according to the backtraces

What did you expect to see?

No SIGSEGV

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label May 7, 2025
@randall77
Copy link
Contributor

That value 2 for the spanAllocType should no longer happen at all in 1.24, so upgrading might very well fix things for you.

I don't see how spanOf could return nil for a non-nil s.largeType. Very strange. I think without a reproducer it will be very hard to track this down.

@kris-watts-gravwell
Copy link
Author

That's what I was worried about, of our 40ish workers that have processed literal 100s of petabytes, we have seen it about 12 times.

Knowing that 1.24 shouldn't see this at all is good enough for us, but I figured I should post it in case anyone had ideas and because 1.23 is still maintained.

@cherrymui cherrymui added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label May 8, 2025
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/671096 mentions this issue: runtime: remove ptr/scalar bitmap metric

gopherbot pushed a commit that referenced this issue May 8, 2025
We don't use this mechanism any more, so the metric will always be zero.
Since CL 616255.

Update #73628

Change-Id: Ic179927a8bc24e6291876c218d88e8848b057c2a
Reviewed-on: https://go-review.googlesource.com/c/go/+/671096
Reviewed-by: Keith Randall <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
Auto-Submit: Keith Randall <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
Status: No status
Development

No branches or pull requests

4 participants