How I used pprof to investigate huge memory consumption issues in our Go Microservice

Shubham Ghadge
5 min readAug 16, 2021

--

Recently my team had an issue with one of our Go microservices deployed on a Kubernetes cluster. The service suddenly started to consume lots of memory. And from the logs, we frequently saw OOM Killed errors. When the pod reached 80–90% memory usage, our horizontal autoscaler scaled the service, but the pod, which reached the memory limit, kept crashing. The autoscaler created new pods, but even the new pods crashed after some time because of the same reason, which kept happening again and again for a couple of days.

Our service runs some jobs in intervals which launches a pool of goroutines to perform the job. We obviously put some limit on these workers. We have multiple similar microservices which perform almost the same job, and they were working fine. But only this service was having the issue.

To find the root cause, we analyzed the pods’ logs, resolved few internal issues, but the memory usage remained high. We shifted our focus from code fixing to memory utilization as we thought increasing memory might work. So we doubled the RAM from 512 MiB to 1 GiB, halved the worker pool, and increased the number of replicas the autoscaler can deploy. And now we had a bunch of new pods that were crashing. The issue persisted.

The memory usage of pods looked like this on Grafana:

each colored line represents memory usage of a pod

pprof is a tool that can help you visualize and analyze your runtime profiling data. I had previously tried to use pprof about a year ago, but most of it went over my head. So I gave it a try again.

We waited few days to gain access to the cluster as the pprof APIs were only exposed inside the cluster. After getting access to the cluster, I connected to it by modifying my KUBECONFIG and then port-forwarded one of the pods using kubectl.

The issue was only with memory, so I was only interested in getting the heap profiling data. First, I randomly tried to collect the profile using curl:

curl http://localhost:6060/debug/pprof/heap > heap.out and visualized using go tool pprof -http=:9090 heap.out

From here, I chose the inuse_space sampling because I wanted to check only the inuse memory and looked at the graph. I spent almost the whole day understanding what exactly I was looking at. After reading multiple blogs and watching a bunch of talks on YouTube, I understood the basics. This official blog from Go was one of the blogs from which I understood the most: https://blog.golang.org/pprof.

My first mistake was to get a single profile at a random time, and I did this multiple times. After understanding my mistake, I collected many profiles by hitting the API in a while loop after every 1 minute and timestamped them. Then I picked only those profiles whose timestamp matched with the time at which there was the highest memory usage. Like, at 14:44 from this graph.

memory usage of one of the pods which I was using to collect profiling data from

After visualizing the profile, this time, I quickly found out the issue.

The top results of inuse_space from the profile looked like this:

The nodes holding lots of memory looked like this in the pprof graph:

p.s. This profile is not the actual profile from which I found the issue.

I confirmed that the issue was with large amounts of JSON marshaling and unmarshalling. And when I followed the graph to the top, one of the functions in our code was there. I looked at different profiles and found out that the same function was coming again and again. The large JSON data was coming from an API that we were hitting in that function. Ironically, the API fetches a list of pods from a cluster. When I checked a single record from the response, it was a whopping 15000 lines of JSON! We were processing multiple records of almost this size at the same time, all in memory.

Also, we were using ioutil.ReadAll() to read the whole response into the memory and then marshaling it into a map. And after processing the JSON, we were storing the whole document into mongo and, while doing so, bson uses a lot of reflection This might be why reflect mapassign is showing in the graph.

To finally confirm if this was the issue, we temporarily commented that code and pushed the changes to our development environment. And lo and behold, our pods were now only using 300 MiB of memory at max!

Now, to solve this issue, I just had to deserialize the response in small amounts at a time. The API did support JSON stream, and I could have used json.NewDecoder but our function to make HTTP calls, which we are kind of forced to use, is in another package, and we did not want to re-write separate logic to make the API call. So I was stuck with ioutil.ReadAll . Thankfully the API we were hitting supported pagination, and I just wrote some code to fetch the response in batches. And voila, the issue was fixed!

After pushing the new changes, this is how the memory usage of pods looked like in Grafana :

only 2 pods and both below 300 MiB!

At first, what we thought might be a memory leak issue, thankfully turned out to be a minor issue with our code logic.

The pprof tool is awesome! There is so much to learn from it. Grafana hooked up with Prometheus to collect these metrics, also played a major role in figuring this out. We could have used Jaeger to trace which part of the code took so much time to execute, but unfortunately, we had not instrumented this particular code with opentracing. Nevertheless, it was a valuable learning experience.

This is my first time writing something like this! Please do let me know if I said or done something wrong.

Thanks for reading,

Shubham, a fellow Gopher.

References:
https://pkg.go.dev/net/http/pprof

https://www.youtube.com/watch?v=nok0aYiGiYA

https://www.youtube.com/watch?v=N3PWzBeLX2M

https://www.youtube.com/watch?v=ydWFpcoYraU

--

--