Ok, that was really too dumb of me. But let us see how sockets, files, and all that works. AND HOW things go south. I will basically walk you through my thinking process to get into this error.
6:00 AM
I was called very early in the morning from the client. That is usually a bad sign. The client was complaining that the server was down. In fact it is never the server is down. That never is the case, but there is that. I took it from there and ssh
ed to the server. Strike 1: we don’t have any monitoring system in-place to alert for any such incidents. If i make those into 7 strikes, i will confess to my crush – back to business.
First thing i do is to access the system from the same channel they did just to have the same feedback. Checking the admin page and wow it 404
ed. There is something weird.
Go to the server:
$ journalctl -u $process_name -t "since 1 hour ago"
This basically uses systemd
to get $process_name log from the past $time – linux is magic! And Stdout to the win! Fuck loggers! Syslog! Ok ok i just got caught in the moment. Aaaand we got this
2021/06/01 06:35:50 http: Accept error: accept tcp [::]:6662: accept4: too many open files;
That is really seems awfully terrible! Now let me my dear friend walk you through this message:
- Accept error: well accept is a linux command that accepts network requests! It is that clear <insert_michael_scotts_that_is_as_clear_as_i_can_make_it_meme_here>. Ok, it basically says that hey dude i’m the linux, ooops the kernel i meant the thing that does your tcp networking and other shit? you know me? Here is the thing: i’m so overwhelmed dude and i cannot accept this request. Byyyyye!
That is not super helpful aight? Like why cannot you accept connections? There are many reasons as to why accept won’t accept things:
- network card stuff (i honestly don’t know the specifics, but i can only imagine there would be some)
- file descriptors
- others that are not relevant to our discussion right now, and i also cannot recall them
We are going to discuss file descriptors only for the purpose of this post.
So what is a file descriptor or fd
what the heck is that! I will use my own understanding to describe it and then go to the lovely linux journal to get an informed opinion.
Any program is a process. And any process has a process id (so that the kernel can keep track of all running processes and better isolate and manage them). A file can be opened for reading from multiple processes. But it is never safe to have multiple processes writing to the same file. Or fid
or fd
. The terminology is quite confusing. And we all know that good ‘ol practice to always close the file after finish writing to it.
So basically: file descriptors help the kernel maps between an existing file and it is associated processes.
Now, to the formal, rather informed definition of file descriptors:
The open() system call opens the file specified by pathname. (...)
The return value of open() is a file descriptor, a small,
nonnegative integer that is used in subsequent system calls
(read(2), write(2), lseek(2), fcntl(2), etc.) to refer to the
open file. The file descriptor returned by a successful call
will be the lowest-numbered file descriptor not currently open
for the process.
(...)
A call to open() creates a new open file description, an entry in
the system-wide table of open files. (...) A
file descriptor is a reference to an open file description; this
reference is unaffected if pathname is subsequently removed or
modified to refer to a different file. (...)
I tried to only show the most relevant part, but still it ended up being lengthy. But to summarize that text:
open
which is a syscall, when invoked, returns a non-negative number, and that what we call file descriptor- file descriptor is remains unchanged and resilient to the source file changes, renames, etc. A snapshot at the invocation of
open
that only has the data at that time. Pretty dope if you asked me. - and the last part is calling
open
creates a new entry for this particular fd in the system-wide table of open files!
WOW! What have we learned! System. Wide. Table. Of. Open. Files. So the kernel keeps track of every existing current open file. Now one thing i know about systems is that whenever you keep track of things, you impose limits! You cannot infinitely increase that “existing open files tables” forever. That is just not how things work!
So, i know this command line called ulimit
that shows – but but, we will get back to it later.
Back to our error message again:
2021/06/01 06:35:50 http: Accept error: accept tcp [::]:6662: accept4: too many open files;
Now we know about accept. [::]:6662 means that any incoming request to port 6662 (the internal port we are using). The error is accept4: too many open files
.
Ok, we are getting some where with this. This error tells us that we have exceeded the maximum number of open files that the kernel allows us to have. Remember the table of all open files and that we said it has a cap? Well, you bet! We have exceeded that cap.
Now, the first thing we gonna do is to check the number of our opened file descriptors to our process.
$ lsof | awk '{ print $2 " " $1; }' | sort -rn | uniq -c | sort -rn | head -10
So, this command is basically copied from SO, but to go through it quickly, here what it does. lsof
is literally short for list of open files😂! This is the one we need. lsof
results are like this:
COMMAND | PID | TID | USER | FD | TYPE | DEVICE SIZE/OFF | NODE NAME |
---|---|---|---|---|---|---|---|
init | 1 | root | cwd | unknown | 0 | /proc/1/ |
We use awk
to gret the second column PID
and put a space and then the first column COMMAND
. We sort and get the unqiue values and then sort them again and get only the top ten processes.
Unsurprisingly, our program is the one with the most used open files! Panic! Or not VERY panic since we already identify the problem.
let us fix that
I believe that there are three states in debugging: denial, anger, and acceptance.
So, we are leaking resources. We are not closing a file. Easy, we are only using files in two locations (or not?):
- we store images in the file system
- we are using sqlite
Simple. And easy to fix. We wrote in such a fine way that the database connection has a one handler, and the image is also has one handler that reads the base64 and dumps it into the file system (respecting their image encoding). But yeah, the problem is well cleared out.
Database part was well clean actually and doesn’t leak any thing. Ha! The image part however doesn’t actually close the file at all! Simple fix just add defer image.Close()
and the magic is done!
And so i did. I deployed the change, restarted the go server and call it a day. Checked the admin page again and it works. Wow! That actually only took me a matter of minutes. But some thing telling me there is something wrong:
- after i deployed the changes i restarted the server, which closes any existing connection and / or file buffers. That is not so cool?
- i didn’t get deeper into what these files are? I mean, i know (intuition, not informed, that they are in the memory and not flushed yet to the storage. We lost all of that.)
- That simple fix:
- is not tested
- doesn’t explain the root problem at all
- and it didn’t touch on the problem, like what these files are and why that happened!
- And that was the denial part.
But nonetheless i deployed the fix, every thing back to working again, had a lovely chat with the crush and that is that.
So, just to challenge my other me, i went to the server several hours later (the electricity was conveniently off) and i was like ok dude let’s try and see if you gonna break again.
It did break again. Fuck it! This time i knew i had to go with the hard way and dig deeper (DEEPER SON!).
Let us do the VERY simple solution: increase the maximum number of opened files. We can achieve that simply with:
$ uname -n 9000
Notice that in linux systems, the default max is: 1024. We increased that and nope still we are well beyond that. Well, it turns out there are two modes:
- hard limit (at the kernel space)
- soft limit
The user can only decrease the hard limit, but they cannot increase it. Whilst, the user is free to incr/decr the soft limit. We are interested in the hard limit thu. Increasing the hard limit does require a log out, and a new session (i was a bit lazy to do that), and the other thing: it doesn’t help with the problem at all! It just delays it for another time.
I had three thoughts:
- http connections open sockets and a socket is a file!
- we closed the image leakage
- but we still hit max opened files.
We have only have one connection to the database (one fd), and now we always release image files. Things should work. Apparently things didn’t work, and we had to dig deeper. Let us investigate those opened files! Where are their paths and what the hell are they!
$ ls -l /proc/620272/fd
This command basically lists every file that under our process id (620272) – can you tell where did we get our process id? That lists a whole lots of 10,000 files. Going deeper, we want to get even more details about all of those 10K files. What they link to, what are their paths, etc. And lsof
for the win, yet again.
$ lsof -i -a -p 620272
What it does is simply listing all of existing network files i
for the process id -p. That is the whole thing. The result was astonishing, to say the least.
ubuntu@client ~> lsof -i -a -p 620272
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
xxxx 620272 ubuntu 4u IPv6 13530919 0t0 TCP *:6662 (LISTEN)
xxxx 620272 ubuntu 1010u IPv4 13716482 0t0 TCP 10.64.2.135:54140->xxx.xxx.xxx.xxx:http (CLOSE_WAIT)
xxxx 620272 ubuntu 1011u IPv4 13694972 0t0 TCP 10.64.2.135:54160->xxx.xxx.xxx.xxx:http (ESTABLISHED)
We have got only three opened files! Fuck me! Wuuuut! Ok ok i’m being hilarious here. We are leaking something. And it is divided between two classes:
- CLOSE_WAIT
- ESTABLISHED
We are leaking file descriptors through outgoing connections we make to an external service and stuck into establishing new ones (to the same external service). So ok why what why did that happen or even how? what does close wait means too? It actually roughly works like this (for close_wait case):
- we send a request to the server
- the server receives it and responds back
- we are pending until the server replies to us
- we then close the connection, or in certain cases we employ http keep alive to better utilize an existing sockets between the two endpoints, and that help with the performance
I’m gonna spare you the details of this, but basically close_wait is that the external service is awaiting us to tell it that we are finish with the connection and can safely close the socket (and so are us too). But that didn’t happen at all!
Now, checking go’s standard library and check for the http client closing. Go’s http client surprisingly doesn’t employ timeouts (so weird), and the user is responsible for handling that. The fix for that is quite very simple, and it has two ways:
- first employ a timeout for timeouts for the http client so that when a connection error occurs (say their server is down), we don’t keep the socket open forever!
- second, and that is not usually needed, but after receiving a respond from the server, we manually close the connection from our side. It is just an extra step, but mostly not needed at all.
Deployed the changes, restarted the server and viola everything works! We are getting back to sane file descriptors.
What did we learn
A lot. But i will write that some time later.
- Things go wrong. Always. Design for one things go wrong.
- The system we have been working on is quite very simple, but one thing that stood to me is that it helped a lot that the code was a bit tidy. I have not touched this code in months, and it sure paid back a lot to have those things arranged in a good way.
- We didn’t have a monitoring system in place and that is really scary. There should be an automated way to report all of the errors to its team.
- Testing is really important. This system was not exposed to a stress testing in any manner. It would have taken me about couple of hours and that would have saved us the pain of this whole debugging session
- It is important to test for the platform as well. Stuff like
ulimit
can only be reproduced in a real system (of course that is not the case with cloud providers such as PaaS). But nonetheless it is important to get the experience - Also the writing is inspired by @3omer