Wednesday, September 18, 2024
Home » Bug hunting in Go: The culprit is in the library

Bug hunting in Go: The culprit is in the library

Despite its growing popularity as an open-source programming language, it’s still possible to unearth a few juicy bugs in Go. That’s what happened when Benoit Artuso, Software Architect at Scality, and his team were at work on the development and benchmarking our software. Here’s how they exposed the bug and takeaways from the sleuthing.

To start from the beginning, in our code we have to call another service through HTTP. This is what our pseudo code looks like:

function call_service(dataBuffer)
    result = http.Put(destination, dataBuffer)
    return result
dataBuffer = "some data&"
result = call_service(dataBuffer)
print result

This is clean and simple enough. But code is code, and there’s often a catch: This time, it’s that our component wants to know when dataBuffer can be cleaned and reused.

So instead of the code above, we run this:

dataBuffer = get_an_existing_buffer("some data")
result = call_service(dataBuffer)
reuse(dataBuffer)
print result

This code works most of the time. But not all of the time: some tests still fail. Even worse, the team discovers that they can’t reuse the buffer right away when the http.Put() failed.

So the detective work continues. We root around for the cause and devise a first fix. Instead of reusing the buffer right away, we attach a callback on it to notify the code when it’s garbage collected. At this point, the bad actor appears to be behind us. Sure, memory levels in tests are slightly high, but not alarming.

Fast-forward to the point where it’s time for large-scale tests on a real, high-performance platform. Under these conditions, the memory usage skyrockets above acceptable levels, prompting us to postpone a few tests. Backtracking, it turns out the previous workaround spikes memory usage.

Once again, we return to the drawing board for a different approach. This time, we use dataBuffer to expose the specific method close() that the http.Put() will call (according to documentation) when it no longer needs the buffer.

New code:

dataBuffer = get_an_existing_buffer("some data")
dataBuffer.close() = function () { reuse(dataBuffer) }
    result = call_service(dataBuffer)
    print result

At this point, everything looks good and follows the documentation. Memory levels settle down to acceptable levels. Problem solved? Hardly. There are suddenly a lot of errors in large-scale tests, more than previously. Worse yet, the software seems allergic to this code and crashes early on. Taking a look at the crash logs, it looks like the close() is getting called twice.

Trying to pinpoint the source of those errors prompts a change in behavior of the dataBuffer resulting in http.Put() creating a request without a proper Content-Length. While the request is still valid (RFC-wise), the server refuses this kind of request and responds with a proper error code.

Takeaways of hunting bugs in Go

  • Go / std is a very good language/library. Most of the issues we had are due to us not doing “the right thing” while trying to push the envelope and what it implied for the input buffer lifecycle.
  • Every single word in the Go documentation matters. Don’t skip even a sentence!
  • Testing is your friend, don’t ignore it
  • Run large scale and long-running tests as often as possible. We have exactly that in the pipeline
  • Leave no stone unturned, and report what you find. It’s good for the Go ecosystem and, in the long term, benefits everyone.

This turns out to be an easy fix, and to our relief the crashing stops, too. Our suspicion:

if an error happens at a critical point, the Go standard library reacts by closing the buffer twice.

To narrow it down to the real culprit, we first reproduce the issue on a smaller scale and then reduce it until we’re left with a small, one-file, reproducer to open the issue on the golang/go repository.

Cue dramatic music: The bug is lodged in Go stdlib’s HTTP server!

Is that the end of the story? Nope. There’s no definitive solution yet. We’re hoping for a fix upstream but if it doesn’t come through, we’ll consider a workaround in the code.

Stay tuned!

Leave a Comment

* By using this form you agree with the storage and handling of your data by this website.

About Us

Solved is a digital magazine exploring the latest innovations in Cloud Data Management and other topics related to Scality.

Editors' Picks

Newsletter

Challenges solved, insights delivered, straight to your inbox.

Receive hand-picked articles, case studies, and expert opinions. Keep up with industry innovations and get actionable insights to optimize your strategy.

All Right Reserved. Designed by Scality.com