by Paul Smith (@paulsmith)
How does the Redis server work?
I was curious to learn more about Redis’s internals, so I’ve
been familiarizing myself with the source, largely by reading and jumping
around in Emacs. After I had peeled back enough of the onion’s layers,
I realized I was trying to keep track of too many details in my head, and it
wasn’t clear how it all hung together. I decided to write out in
narrative form how an instance of the Redis server starts up and initializes
itself, and how it handles the request/response cycle with a client, as a way
of explaining it to myself, hopefully in a clear fashion. Luckily, Redis has
a nice, clean code base that is easy to read and follow along. Armed with a
TAGS file, my
$EDITOR
, and GDB, I set out to see how it all works under the
hood. (Incidentally, I was working with the Redis code base as of
commit b4f2e41. Of course, internals such as I outline below are subject
to change. However, the broad architecture of the server is unlikely to
change very much, and I tried to keep that in mind as I went along.)
This article examines server startup and takes a high-level view of the
request/response processing cycle. In a subsequent article,
I’ll dive in to greater detail and trace a simple
SET
/GET
command pair as they make their way
through Redis.
Let’s begin with the main()
function in
redis.c
.
First, initServerConfig()
is called. This partially
initializes a variable server
, which has the type struct
redisServer
, that serves as the global server state.
// redis.h:338 struct redisServer { pthread_t mainthread; int port; int fd; redisDb *db; // ... }; // redis.c:69 struct redisServer server; /* server global state */
There are a huge number of members in this struct, but they generally fall into the following categories:
For example, this struct includes members that map to options in the
configuration file (usually named redis.conf
) such as the port
the server listens on and how verbose logging should be, pointers to linked
lists of connected clients and slave Redis servers as well as the Redis
database(s) itself, and counters for statistics like the number of commands
processed since startup.
initServerConfig()
provides default values for the members
that can be configured by the user via the redis.conf
config
file.
The next thing main()
does is sort the table of Redis
commands. These are defined in a global variable
readonlyCommandTable
which is an array of struct
redisCommand
s.
// redis.c:70 struct redisCommand *commandTable; struct redisCommand readonlyCommandTable[] = { {"get",getCommand,2,REDIS_CMD_INLINE,NULL,1,1,1}, {"set",setCommand,3,REDIS_CMD_BULK|REDIS_CMD_DENYOOM,NULL,0,0,0}, {"setnx",setnxCommand,3,REDIS_CMD_BULK|REDIS_CMD_DENYOOM,NULL,0,0,0}, {"setex",setexCommand,4,REDIS_CMD_BULK|REDIS_CMD_DENYOOM,NULL,0,0,0}, {"append",appendCommand,3,REDIS_CMD_BULK|REDIS_CMD_DENYOOM,NULL,1,1,1}, // ... }; // redis.h:458 typedef void redisCommandProc(redisClient *c); // ... struct redisCommand { char *name; redisCommandProc *proc; int arity; int flags; // ... };
The read-only table is ordered in source code so that commands are grouped
by category, such as string commands, list commands, set commands, etc., to
make it easier for a programmer to scan the table for similar commands. The
sorted table of commands is pointed to by the global variable
commandTable
, and is used to lookup Redis commands with a
standard binary search (lookupCommand()
, which returns a pointer
to a redisCommand
).
main()
moves on to processing command-line options given by
the user starting up the redis-server executable. Currently, there is only a
single argument that Redis takes — aside from the usual version
-v
and help -h
flags — which is a path to a
config file. If the path is given, Redis loads the config file and overrides
any defaults already set by initServerConfig()
by calling
loadServerConfig()
. This function is fairly straightforward,
looping over each line in the config file and converting values that match
directive names to appropriate types for the matching member in the
server
struct. At this point, Redis will daemonize and detach
from the controlling terminal if it has been configured to do so.
initServer()
initServer()
finishes the job of initializing the
server
struct that was begun by initServerConfig()
.
First, it sets up signal handling (SIGHUP
and
SIGPIPE
signals are ignored — there is an opportunity to
improve Redis by adding the ability to reload its config file when it
receives a SIGHUP
, in the fashion of other daemons), including
printing a stacktrace if the server receives a SIGSEGV
(and
other related signals), see segvHandler()
.
A number of doubly-linked lists (see adlist.h
) are created to
keep track of clients, slaves, monitors (a client that has sent the
MONITOR
command), and an object free list.
One interesting thing Redis does is create a number of shared objects,
which are accessible via the global shared
struct. Common Redis
objects that are required by many different commands, response strings and
error messages, for example, can be shared without having to allocate them
each time, saving memory, with the tradeoff of a bit more initialization
effort at startup time.
// redis.c:662 shared.crlf = createObject(REDIS_STRING,sdsnew("\r\n")); shared.ok = createObject(REDIS_STRING,sdsnew("+OK\r\n")); shared.err = createObject(REDIS_STRING,sdsnew("-ERR\r\n")); shared.emptybulk = createObject(REDIS_STRING,sdsnew("$0\r\n\r\n")); // ...
The greatest impact in terms of memory savings with shared objects comes from a large pool of shared integers.
// redis.c:705 for (j = 0; j < REDIS_SHARED_INTEGERS; j++) { shared.integers[j] = createObject(REDIS_STRING,(void*)(long)j); shared.integers[j]->encoding = REDIS_ENCODING_INT; }
createSharedObjects()
creates an array of the first 10,000
non-negative integers as Redis objects (strings with an integer encoding).
Various Redis objects like strings, sets, and lists often contain many small
integers (for IDs or counters), and they can reuse the same objects already
allocated in memory, for a large potential savings. One could imagine the
constant that defines the number of shared integers to be created,
REDIS_SHARED_INTEGERS
, being exposed to configuration to give
users the opportunity, based on knowledge of their applications and needs, to
increase the number of shared integers for greater memory savings. The
tradeoff is that Redis statically allocates slightly more memory at startup
time, but this would likely be a small amount compared to the overall typical
database sizes and potential savings.
initServer()
continues by creating the core event loop,
calling aeCreateEventLoop()
(see ae.c
) and
assigning the result to the el
member of server
.
ae.h
provides a platform-independent wrapper for
setting up I/O event notification loops, which uses epoll
on
Linux, kqueue
on BSD, and falls back to select
if
the respective first choice is not available.) Redis’s event loop polls
for new connections and I/O events (reading requests from and writing
responses to a socket), being triggered when a new event arrives. This is
what makes Redis so responsive, it can serve thousands of clients
simultaneously without blocking while individual requests are processed and
responded to.
initServer()
also initializes a number of
redisDb
objects, which are structs that encapsulate the details
of a particular Redis database, including tracking expiring keys, keys that
are blocking (either from a B{L,R}POP
command or from I/O), and
keys that are being watched for check-and-set. (By default there are 16
separate databases, which can be thought of as namespaces within a Redis
server.)
initServer()
is where the socket that Redis listens for
connections (by default, bound to port 6379) is set up. Another Redis-local
wrapper, anet.h
, defines anetTcpServer()
and a
number of other functions that simplify the usual complexity of setting up a
new socket, binding, and listening to a port.
// redis.c:791 server.fd = anetTcpServer(server.neterr, server.port, server.bindaddr); if (server.fd == -1) { redisLog(REDIS_WARNING, "Opening TCP port: %s", server.neterr); exit(1); }
initServer()
further allocates various dicts and lists for
databases and for pub/sub, resets stats and various flags, and notes the UNIX
timestamp of the server start time. It registers serverCron()
with the event loop as a time event, executing that function once every 100
milliseconds. (This is a bit tricky, because initially,
serverCron()
is set to run in 1 millisecond by
initServer()
, in order to have the cron cycle start right away
with the server startup, but then the return value of
serverCron()
, which is 100, is plugged in to the calculation of
the next time the time event process should be handled.)
serverCron()
performs a number of periodic tasks for Redis,
including verbose logging of database size (# of keys and memory used) and
connected clients, resizing hash tables, closing idle/timed-out client
connections, performing any post-background save or AOF rewrite cleanup,
kicking off a background save if the save conditions as configured have been
met (so many keys changed in so many seconds), calculating LRU information
and dealing with expired keys (Redis only expires a few timed-out keys per
cron cycle, using a adaptive, statistical method to avoid tying up the
server, but will get more aggressive if expiring keys can help avoid
out-of-memory situations), swapping out values to disk if virtual memory is
enabled, and syncing with a master if this server is a slave.
Crucially, initServer()
hooks up the event loop with the
server’s TCP socket by registering the socket’s descriptor,
registering the acceptHandler()
function to be called when a new
connection is accepted. (More on this below in the “Processing a
request” section.)
// redis.c:821 if (aeCreateFileEvent(server.el, server.fd, AE_READABLE, acceptHandler, NULL) == AE_ERR) oom("creating file event");
initServer()
creates or opens the append-only file (AOF), it
the server was configured to use it.
// redis.c:824 if (server.appendonly) { server.appendfd = open(server.appendfilename,O_WRONLY|O_APPEND|O_CREAT,0644);
Finally, initServer()
initializes Redis’s virtual
memory system, again, if the server had been configured for it.
main()
If the server was configured to daemonize, Redis will now try to write out
a pid file (the path of which is configurable, but defaults to
/var/run/redis.pid
).
At this point, the server has started up, and Redis will log this fact to
its log file. However, there’s still a bit more to do in
main()
before Redis is fully ready.
If there is an AOF or a database dump file (eg., dump.rdb
),
it will be loaded, restoring data to the server from a previous session. (If
both exist, the AOF takes priority.)
// redis.c:1452 if (server.appendonly) { if (loadAppendOnlyFile(server.appendfilename) == REDIS_OK) redisLog(REDIS_NOTICE,"DB loaded from append only file: %ld seconds",time(NULL)-start); } else { if (rdbLoad(server.dbfilename) == REDIS_OK) redisLog(REDIS_NOTICE,"DB loaded from disk: %ld seconds",time(NULL)-start); }
The server is now ready to start accepting requests.
To finish up, Redis registers a function to be called each time it enters
the event loop, beforeSleep()
(since the process essentially
goes to sleep while it waits to be notified of events).
beforeSleep()
does two things: it deals with serving clients
that have requested keys that were swapped to disk if the virtual memory
system was enabled, and it flushes the AOF to disk. The writing of the AOF is
handled by flushAppendOnlyFile()
. The function encapsulates some
tricky logic about flushing the buffer that holds pending AOF writes (the
frequency of which is configurable by the user).
Redis now enters the main event loop by calling aeMain()
,
with the argument server.el
(remember, this member contains a
pointer to an aeEventLoop
). If there are any time (i.e., server
cron) or file events to process each time through the loop, their respective
handler functions will be called. aeProcessEvents()
encapsulates
this logic — time events are handled by custom logic, whereas file
events are handled by the underlying epoll
or
kqueue
or select
I/O event notification system.
Because of Redis’s need to respond to time events as well as file or
I/O events, it implements a custom event/polling loop, aeMain()
.
By checking to see if any time events need processing, and utilizing file
event notification, the event loop can efficiently sleep until there is work
to be done, and not tie up the CPU in a tight while
loop.
We are now inside Redis’s main event polling loop, listening on a port and waiting for clients to connect. It’s time to look at how Redis processes a command request.
Back under initServer()
, Redis registered
acceptHandler()
to be called when there was an I/O event
associated with the file descriptor of the socket the server is listening to
(i.e., the socket has data waiting to be read or written).
acceptHandler()
creates a client object — pointer to a
redisClient
, a struct defined in redis.h
— to
represent a new client connection.
// networking.c:347 cfd = anetAccept(server.neterr, fd, cip, &cport); if (cfd == AE_ERR) { redisLog(REDIS_VERBOSE,"Accepting client connection: %s", server.neterr); return; } redisLog(REDIS_VERBOSE,"Accepted %s:%d", cip, cport); if ((c = createClient(cfd)) == NULL) { redisLog(REDIS_WARNING,"Error allocating resoures for the client"); close(cfd); /* May be already closed, just ingore errors */ return; }
createClient()
is called to allocate and initialize the
client object. It selects database 0 by default (since there has to be at
least one Redis db per server), and associates the client file descriptor
generated by accept(2)
in
acceptHandler()
with the client object. Other flags and members
are initialized, and finally the client is appended to the global list of
clients being tracked by server.clients
. The key thing Redis
does in createClient()
is registering a handler with the event
loop, the function readQueryFromClient()
, for when there is data
from the client connection to be read.
// networking.c:20 if (aeCreateFileEvent(server.el,fd,AE_READABLE, readQueryFromClient, c) == AE_ERR) { close(fd); zfree(c); return NULL; }
readQueryFromClient()
is called by the main event loop when
the client makes a command request. (If you are debugging with GDB, this is a
good function to set as a breakpoint.) It reads it as much as it can of the
command — up to 1024 bytes — to a temporary buffer, then appends
it to a client-specific query buffer. This allows Redis to process commands
where the payload (command name plus arguments) is larger than 1024 bytes, or
because of I/O reasons have been split up into multiple read events. It then
calls processInputBuffer()
, passing the client object as an
argument.
// networking.c:754 void readQueryFromClient(aeEventLoop *el, int fd, void *privdata, int mask) { redisClient *c = (redisClient*) privdata; char buf[REDIS_IOBUF_LEN]; int nread; // ... nread = read(fd, buf, REDIS_IOBUF_LEN); // ... if (nread) { size_t oldlen = sdslen(c->querybuf); c->querybuf = sdscatlen(c->querybuf, buf, nread); c->lastinteraction = time(NULL); /* Scan this new piece of the query for the newline. We do this * here in order to make sure we perform this scan just one time * per piece of buffer, leading to an O(N) scan instead of O(N*N) */ if (c->bulklen == -1 && c->newline == NULL) c->newline = strchr(c->querybuf+oldlen,'\n'); } else { return; } Processinputbuffer(c); }
processInputBuffer()
parses the raw query from the client
into arguments for execution of a Redis command. It first has to contend with
the possibility that the client is blocked in a B{L,R}POP
command, and bails early if that is the case. The function then parses the
raw query buffer into arguments, creating Redis string objects of each and
storing them in an array on the client object. The query is in the form of
the Redis
protocol. processInputBuffer()
is really a protocol parser,
calling back to processCommand()
to fully parse the query.
Somewhat confusingly, the source code comments describe parsing the
“multi bulk command type,” an alternative protocol originally
meant for commands like MSET
, but it actually is now the main
Redis protocol for all commands. It is binary-safe and easy to parse and
debug. (Note: this code has been refactored for the upcoming 2.2 release and
is a bit easier to follow.) Now it’s time to actually execute the
command the client has sent, by calling processCommand()
on the
client object.
processCommand()
takes the arguments of a command from a
client and executes it. Before it gets to the actual execution of the
command, it performs a number of checks — if any check fails, it
appends a error message to the client object’s reply list and returns
to the caller, processInputBuffer()
. After handling the
QUIT
command as a special case (in order to shut down the client
safely), processCommand()
looks up the command name in the
commandTable
that was previously set up during Redis’s
start up cycle. If it’s an unknown command, or the client got the arity
of the command wrong, it’s an error. While it’s not commonly
used, Redis can be configured to require a password to authenticate a client
before it will accept commands, and this is the stage where Redis will check
if the client is authenticate, and will set an error if not. If Redis is
configured to use up to a maximum amount of memory, it will at this point try
to free up memory if it can (be freeing objects from the free list and
removing expired keys), otherwise if the server is over the limit, it
won’t process commands with the REDIS_CMD_DENYOOM
flag set
(mainly writes like SET
, INCR
, RPUSH
,
ZADD
, etc.), again, an error. One final check Redis makes is
that a client can only issue SUBSCRIBE
or
UNSUBSCRIBE
commands while there are outstanding channels
subscribed to, otherwise, it’s an error. If all the checks have been
passed, the command will be executed, by calling call()
with the
client object and the command object as arguments.
call()
, gets a pointer to function of type struct
redisCommandProc
, from the proc
member of the
command
object, which takes a single argument, that of a client
object. The Redis command procedure is called.
// redis.c:864 void call(redisClient *c, struct redisCommand *cmd) { long long dirty; dirty = server.dirty; cmd->proc(c); dirty = server.dirty-dirty; } // ...
Write commands, like SET
and ZADD
, make the
server “dirty,” in other words, the server is marked as having
pages in memory that have changed. This is important for the automatic save
process, which keeps track of how many keys have changed in a certain
period, or the writing to the AOF. The function calls
feedAppendOnlyFile()
if use of the AOF has been enabled, which
writes out the command buffer from the client to the AOF, so the command can
be replayed. (It translates commands that set a relative key expiration to
an absolute expiration, but otherwise it basically copies the command as it
came in from the client, see catAppendOnlyGenericCommand()
.) If
any slaves are connected, call()
will send the command to each
of them to be executed locally, see replicationFeedSlaves()
.
Likewise, if any clients are connected and have issued the
MONITOR
command, Redis will send a representation of the
command, prefixed with a timestamp, see
replicationFeedMonitors()
.
// redis.c:871 (call() cont.'d) // ... if (server.appendonly && dirty) feedAppendOnlyFile(cmd,c->db->id,c->argv,c->argc); if ((dirty || cmd->flags & REDIS_CMD_FORCE_REPLICATION) && listLength(server.slaves)) replicationFeedSlaves(server.slaves,c->db->id,c->argv,c->argc); if (listLength(server.monitors)) replicationFeedMonitors(server.monitors,c->db->id,c->argv,c->argc); server.stat_numcommands++; }
Control returns to the caller, processCommand()
, which resets
the client object for subsequent commands.
As mentioned, each Redis command procedure is itself responsible for
setting the response to be sent to the client. After
readQueryFromClient()
exits and Redis returns the to event loop
in aeMain()
, aeProcessEvents()
will pick up the
waiting response in the write buffer and will copy it to the socket the
client is connected on.
And that’s it! The response has been sent, and both client and server are back to a state where they can respectively emit and process more Redis commands.
Redis starts up by initializing a global server state variable, and
reading in an optional configuration file to override any defaults. It sets
up a global command table that connects command names with the actual
function that implements the command. It creates an event loop using the best
available underlying system library for event/readiness notification, and
registers a handler function for when there is a new client socket connection
to accept. It also registers a periodic (i.e., time-based) event handler for
dealing with cron
-like tasks like key expiry that need to be
addressed outside the regular client-handling path. Once a client has
connected, a function is registered on the event loop for being notified when
the client has data to be read (i.e., a query for a command). The
client’s query is parsed and a command handler is called to execute the
command and write the response back to the client (the writing of data to the
client is also handled by the event-notification loop). The client object is
reset and server is ready to process more queries.
SET
and GET
I’ll follow-up this article with one that takes a close look at the
processing of two commands, SET
and GET
, by
stepping through the implementation of each command’s procedure and
examining the data structures Redis uses to store and index data.
March 15, 2011: here it is.
—Paul Smith (follow me on the Twitter)
Thanks to pietern on #redis
for feedback on a draft of
this article