看到网上有不少讨论epoll,但大多不够详细准确,以前面试有被问到这个问题。不去更深入的了解,只能停留在知其然而不知其所以然。于是,把epoll手册翻译一遍,更深入理解和掌握epoll事件处理相关知识,也涉及到了操作系统内核的知识。
EPOLL(7) Linux Programmer's Manual
NAME
epoll - I/O event notification facility
epoll - I/O 事件通知机制翻译:6700662@qq.com, 转载请注明出处。
DESCRIPTION
The epoll API performs a similar task to poll(2): monitoring multiple file descriptors to see if I/O is possible on any of them. The epoll API can be used either as anedge-triggered or a level-triggered interface and scales well to large numbers of watched file descriptors. The following system calls are provided to create and manage anepoll instance:
Epoll API执行类似于poll的任务:监控多个文件描述符,看它们其中任何一个是否有可能I/O。Epoll API既可以用作边缘触发(ET)或水平触发(LT),并良好的适用大量被监控的文件描述符。提供下面这些系统调用去创建和管理一个epoll实例:
* epoll_create(2) creates an epoll instance and returns a file descriptor referring to that instance. (The more recent epoll_create1(2) extends the functionality ofepoll_create(2).)
* epoll_create(2) 创建一个epoll实例,并返回关联到该实例的文件描述符。(较新的epoll_create1(2)扩展了这个API的功能。)
* Interest in particular file descriptors is then registered via epoll_ctl(2). The set of file descriptors currently registered on an epoll instance is sometimes calledan epoll set.
* 通过 epoll_ctl(2)来注册,以关注特定的文件描述符。当前已在epoll实例注册的文件描述符集合,有时候称作epoll set。
* epoll_wait(2) waits for I/O events, blocking the calling thread if no events are currently available.
* epoll_wait(2)等待I/O事件,如果当前没有可用的事件则阻塞调用线程。
Level-triggered and edge-triggered
水平触发和边沿触发
The epoll event distribution interface is able to behave both as edge-triggered (ET) and as level-triggered (LT). The difference between the two mechanisms can bedescribed as follows. Suppose that this scenario happens:
Epoll事件分派接口可以表现为边沿前触发 (ET)和 水平触发(LT).这两个机制之间的区别可以描述如下。假设这个发生了这个场景:
1. The file descriptor that represents the read side of a pipe (rfd) is registered on the epoll instance.
2. A pipe writer writes 2 kB of data on the write side of the pipe.
3. A call to epoll_wait(2) is done that will return rfd as a ready file descriptor.
4. The pipe reader reads 1 kB of data from rfd.
5. A call to epoll_wait(2) is done.
1. 表示管道读端的文件描述符(rfd)已在epoll实例注册。
2. 管道写入程序,写了2kB的数据在管道写入端
3. 对epoll_wait(2)的调用已完成,将返回rfd作为已就绪的文件描述符。
4. 管道读取程序,从rfd读入1kB的数据。
5.一个对epoll_wait(2)的调用已完成。
If the rfd file descriptor has been added to the epoll interface using the EPOLLET (edge-triggered) flag, the call to epoll_wait(2) done in step 5 will probably hangdespite the available data still present in the file input buffer; meanwhile the remote peer might be expecting a response based on the data it already sent. The reasonfor this is that edge-triggered mode delivers events only when changes occur on the monitored file descriptor. So, in step 5 the caller might end up waiting for some datathat is already present inside the input buffer. In the above example, an event on rfd will be generated because of the write done in 2 and the event is consumed in 3.Since the read operation done in 4 does not consume the whole buffer data, the call to epoll_wait(2) done in step 5 might block indefinitely.
如果rfd文件描述符是用EPOLLET (边沿触发) 标志被加入到epoll接口,在第5步中调用的epoll_wait(2)可能阻塞,尽管可用的数据任然还存在于文件输入缓存中;此时远程对端可能期待它已发送数据的响应。原因是ET模式只有在被监控文件描述符发生变化时才递交事件。所以,第5步的调用者可能终止于等待一些已经存在于输入缓存中的数据(没有触发事件,还在等待接收).在上述例子中,一次rfd上的事件被产生是因为第2步写入完成,并在第3步中消耗。第4步的读操作没有消耗整个缓存数据,在第5步中调用的 epoll_wait(2),可能立即阻塞。
An application that employs the EPOLLET flag should use nonblocking file descriptors to avoid having a blocking read or write starve a task that is handling multiple filedescriptors. The suggested way to use epoll as an edge-triggered (EPOLLET) interface is as follows:
采用EPOLLET标志的应用程序应当使用非阻塞文件描述符,以防止阻塞读或写造成处理多文件描述符的任务发生饥饿。以边沿触发接口(EPOLLET)使用epoll的建议方式如下:
i with nonblocking file descriptors; and
i 使用非阻塞文件描述符;并且
ii by waiting for an event only after read(2) or write(2) return EAGAIN.
ii 只有在read(2)或 write(2)返回EAGAIN之后才等待事件。
By contrast, when used as a level-triggered interface (the default, when EPOLLET is not specified), epoll is simply a faster poll(2), and can be used wherever the latter isused since it shares the same semantics.
与之相比,当作为水平触发接口使用(默认地,当EPOLLET没有被指定),epoll仅仅是更快的poll,并能被用于不管后面用什么,因为它共享相同的语义。
Since even with edge-triggered epoll, multiple events can be generated upon receipt of multiple chunks of data, the caller has the option to specify the EPOLLONESHOT flag,to tell epoll to disable the associated file descriptor after the receipt of an event with epoll_wait(2). When the EPOLLONESHOT flag is specified, it is the caller'sresponsibility to rearm the file descriptor using epoll_ctl(2) with EPOLL_CTL_MOD.
因为即使在边沿触发epoll,在收到多个数据块之后会产生多个事件,调用者还有指定EPOLLONESHOT标志的选项,来告知epoll在epoll_wait(2)收到一个事件之后禁止关联的文件描述符。当EPOLLONESHOT被指明,由调用者负责使用epoll_ctl(2) 和 EPOLL_CTL_MOD来重新授权文件描述符。
Interaction with autosleep
与autosleep的交互
If the system is in autosleep mode via /sys/power/autosleep and an event happens which wakes the device from sleep, the device driver will keep the device awake only untilthat event is queued. To keep the device awake until the event has been processed, it is necessary to use the epoll(7) EPOLLWAKEUP flag.
如果系统通过/sys/power/autosleep进入autosleep模式,并且发生事件把设备从睡眠中唤醒,设备驱动仅仅保持设备唤醒到那个事件进入队列。要保持设备唤醒到事件被处理,必须使用epoll(7) EPOLLWAKEUP标志。
When the EPOLLWAKEUP flag is set in the events field for a struct epoll_event, the system will be kept awake from the moment the event is queued, through the epoll_wait(2)call which returns the event until the subsequent epoll_wait(2) call. If the event should keep the system awake beyond that time, then a separate wake_lock should be takenbefore the second epoll_wait(2) call.
当EPOLLWAKEUP标志设置在epoll_event结构的事件字段,系统将从事件进入队列开始保持唤醒,通过返回事件的epoll_wait(2)直到后续的epoll_wait(2)调用。如果事件要在那个时间之外保持系统唤醒,那么单独的wake_lock应当在第二次调用epoll_wait(2)之前被调用。
/proc interfaces
The following interfaces can be used to limit the amount of kernel memory consumed by epoll:
以下是接口可被用于限制epoll消耗的内核内存总数:
/proc/sys/fs/epoll/max_user_watches (since Linux 2.6.28)
This specifies a limit on the total number of file descriptors that a user can register across all epoll instances on the system. The limit is per real user ID.Each registered file descriptor costs roughly 90 bytes on a 32-bit kernel, and roughly 160 bytes on a 64-bit kernel. Currently, the default value formax_user_watches is 1/25 (4%) of the available low memory, divided by the registration cost in bytes.
指定一个用户通过系统中所有epoll实例能够注册的文件描述符的限制。这个限制是对每个真实用户ID的。每个注册的文件描述符,在32位内核中大致占用90字节,在64位内核中大致占用160字节。一般的,max_user_watches的默认值是1/25(4%)的可用最低内存,除以注册占用字节数。
Example for suggested usage
建议的用法示例
While the usage of epoll when employed as a level-triggered interface does have the same semantics as poll(2), the edge-triggered usage requires more clarification to avoidstalls in the application event loop. In this example, listener is a nonblocking socket on which listen(2) has been called. The function do_use_fd() uses the new readyfile descriptor until EAGAIN is returned by either read(2) or write(2). An event-driven state machine application should, after having received EAGAIN, record its currentstate so that at the next call to do_use_fd() it will continue to read(2) or write(2) from where it stopped before.
当epoll采用水平触发接口时具有poll相同的语义,边沿触发用法要求更清楚说明以防止应用程序事件循环停转。在这个示例中,调用了lister(2)的listener是非阻塞socket.do_use_fd()函数使用新的就绪文件描述符直到read(2)或write(2)返回EAGAIN。事件驱动状态机应用程序应当,在接收到EAGAIN之后,记录它当前的状态所以在下次调用do_use_fd()将从之前停止的地方继续read(2)或 write(2)。
#define MAX_EVENTS 10
struct epoll_event ev, events[MAX_EVENTS];
int listen_sock, conn_sock, nfds, epollfd;
/* Code to set up listening socket, 'listen_sock',
(socket(), bind(), listen()) omitted */
epollfd = epoll_create1(0);
if (epollfd == -1) {
perror("epoll_create1");
exit(EXIT_FAILURE);
}
ev.events = EPOLLIN;
ev.data.fd = listen_sock;
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) {
perror("epoll_ctl: listen_sock");
exit(EXIT_FAILURE);
}
for (;;) {
nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
if (nfds == -1) {
perror("epoll_wait");
exit(EXIT_FAILURE);
}
for (n = 0; n < nfds; ++n) {
if (events[n].data.fd == listen_sock) {
conn_sock = accept(listen_sock,
(struct sockaddr *) &local, &addrlen);
if (conn_sock == -1) {
perror("accept");
exit(EXIT_FAILURE);
}
setnonblocking(conn_sock);
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = conn_sock;
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
&ev) == -1) {
perror("epoll_ctl: conn_sock");
exit(EXIT_FAILURE);
}
} else {
do_use_fd(events[n].data.fd);
}
}
}
When used as an edge-triggered interface, for performance reasons, it is possible to add the file descriptor inside the epoll interface (EPOLL_CTL_ADD) once by specifying(EPOLLIN|EPOLLOUT). This allows you to avoid continuously switching between EPOLLIN and EPOLLOUT calling epoll_ctl(2) with EPOLL_CTL_MOD.
当作为边沿触发(ET)接口使用,为性能原因,有可能通过指明(EPOLLIN|EPOLLOUT)一次性添加文件描述符到epoll接口(EPOLL_CTL_ADD).这允许你在调用epoll_ctl(2)和EPOLL_CTL_MOD时,防止持续在EPOLLIN和EPOLLOUT之间切换。(注:EPOLLIN和EPOLLOUT分两次调用epoll_ctl更耗时间性能).
Questions and answers
Q0 What is the key used to distinguish the file descriptors registered in an epoll set?
用于区分在epoll set中已注册文件描述符的key是什么?
A0 The key is the combination of the file descriptor number and the open file description (also known as an "open file handle", the kernel's internal representation of anopen file).
key是文件描述符数字和”打开文件描述符”的组合(也就是已知的"open file handle",打开文件句柄,内核的一个打开文件的内部表示)。
Q1 What happens if you register the same file descriptor on an epoll instance twice?
在一个epoll实例中对相同的文件描述符注册两次,会发生什么?
A1 You will probably get EEXIST. However, it is possible to add a duplicate (dup(2), dup2(2), fcntl(2) F_DUPFD) descriptor to the same epoll instance. This can be a useful technique for filtering events, if the duplicate file descriptors are registered with different events masks.
你将可能收到EEXIST。然而, 有可能添加副本描述符到相同的epoll实例.这可以是一个过滤事件的有用技巧,如果副本文件描述符用不同的事件掩码去注册。
Q2 Can two epoll instances wait for the same file descriptor? If so, are events reported to both epoll file descriptors?
能用两个epoll实例去等待同一个文件描述符吗?如果那样,事件被报告到两个epoll文件描述符吗?
A2 Yes, and events would be reported to both. However, careful programming may be needed to do this correctly.
是的,并且事件将被报告到两者。不管怎样,需要仔细编程以做正确这事。
Q3 Is the epoll file descriptor itself poll/epoll/selectable?
epoll文件描述符本身是poll/epoll可轮询的吗?
A3 Yes. If an epoll file descriptor has events waiting, then it will indicate as being readable.
是的。如果一个epoll文件描述符有事件在等待,那么它将指示为可读。
Q4 What happens if one attempts to put an epoll file descriptor into its own file descriptor set?
当尝试把epoll文件描述符放入它自己的文件描述符集合中会发生什么?
A4 The epoll_ctl(2) call will fail (EINVAL). However, you can add an epoll file descriptor inside another epoll file descriptor set.
epoll_ctl(2)调用将以(EINVAL)失败. 然而,你可以添加epoll文件描述符到另一个epoll文件描述符集合内。
Q5 Can I send an epoll file descriptor over a UNIX domain socket to another process?
可以通过UNIX域socket发送一个epoll文件描述符到另一个进程吗?
A5 Yes, but it does not make sense to do this, since the receiving process would not have copies of the file descriptors in the epoll set.
是的,但这样做没有任何意义,因为接收进程不会有epoll set中的文件描述符副本。
Q6 Will closing a file descriptor cause it to be removed from all epoll sets automatically?
关闭一个文件描述符,会导致它自动从所有epoll set中被移除吗?
A6 Yes, but be aware of the following point. A file descriptor is a reference to an open file description (see open(2)). Whenever a descriptor is duplicated via dup(2),dup2(2), fcntl(2) F_DUPFD, or fork(2), a new file descriptor referring to the same open file description is created. An open file description continues to exist untilall file descriptors referring to it have been closed. A file descriptor is removed from an epoll set only after all the file descriptors referring to the underlying
open file description have been closed (or before if the descriptor is explicitly removed using epoll_ctl(2) EPOLL_CTL_DEL). This means that even after a file descriptor that is part of an epoll set has been closed, events may be reported for that file descriptor if other file descriptors referring to the same underlying file
description remain open.
是的,但需要清楚以下几点。文件描述符是一个”打开文件描述符”的引用(见 open(2))。每当描述符是副本,通过dup(2),dup2(2), fcntl(2) F_DUPFD, or fork(2),一个指向同一“打开文件描述符”的引用的文件描述符被创建。一个“打开文件描述符”持续存在直达所有到它的文件描述符引用被关闭。只有在指向下层“打开文件描述符”的所有文件描述符引用被关闭时,文件描述符才从epoll set中被移除(或者之前如果描述符是使用epoll_ctl(2) EPOLL_CTL_DEL被明确的移除)。这意味着即使epoll set部分的文件描述符被关闭之后,那个文件描述符的事件可能被报告,如果其他文件描述符引用指向的相同下层文件描述符保持打开.
Q7 If more than one event occurs between epoll_wait(2) calls, are they combined or reported separately?
如果在epoll_wait(2)调用之间多于一个事件产生,它们是合并的还是分别报告?
A7 They will be combined.
它们会被合并。
Q8 Does an operation on a file descriptor affect the already collected but not yet reported events?
文件描述符上的操作会影响已经收集但没有报告的事件吗?
A8 You can do two operations on an existing file descriptor. Remove would be meaningless for this case. Modify will reread available I/O.
你能做两个操作,在一个已存在的文件描述符上。移除将是毫无意义的,对这种情形。修改将会重读可用的I/O(再次产生event?).
Q9 Do I need to continuously read/write a file descriptor until EAGAIN when using the EPOLLET flag (edge-triggered behavior) ?
当使用EPOLLET标志时(边沿触发行为),需要持续的在文件描述符连续的read/write,直到EAGAIN ?
A9 Receiving an event from epoll_wait(2) should suggest to you that such file descriptor is ready for the requested I/O operation. You must consider it ready until thenext (nonblocking) read/write yields EAGAIN. When and how you will use the file descriptor is entirely up to you.
从epoll_wait(2)收到事件,应当指示你如此的文件描述是已就绪于请求I/O操作。你必须认为它是就绪的,直到下一个(非阻塞)read/write产生EAGIN. 何时、如何使用这个文件描述符完全取决于你。
For packet/token-oriented files (e.g., datagram socket, terminal in canonical mode), the only way to detect the end of the read/write I/O space is to continue toread/write until EAGAIN.
对于包/符号导向的文件(比如 UDP socket,标准模式的终端), 唯一检测read/write I/O空间结束的方法,是连续read/write直到EAGIN.
For stream-oriented files (e.g., pipe, FIFO, stream socket), the condition that the read/write I/O space is exhausted can also be detected by checking the amount ofdata read from / written to the target file descriptor. For example, if you call read(2) by asking to read a certain amount of data and read(2) returns a lower numberof bytes, you can be sure of having exhausted the read I/O space for the file descriptor. The same is true when writing using write(2). (Avoid this latter techniqueif you cannot guarantee that the monitored file descriptor always refers to a stream-oriented file.)
对于流导向的文件(例如 pipe, FIFO, TCP socket),read/write I/O空间耗尽的条件也能通过 读取于/写入到 目标文件描述符的数据总数来检测。例如,如果你调用read(2)要求读取确定的数据总数,并且read(2)返回更低的字节数,你能确认该文件描述符的 read I/O 空间已经耗尽。使用write(2)来写入时也一样。(如果你不能保证被监控的文件描述符一直指向流式文件,避免使用后面的字节数技巧)。
Possible pitfalls and ways to avoid them
可能的陷阱和避免方法
o Starvation (edge-triggered)
饥饿(边沿触发)
If there is a large amount of I/O space, it is possible that by trying to drain it the other files will not get processed causing starvation. (This problem is not specificto epoll.)
如果有大量的I/O空间,有可能尝试耗尽它,其它文件将得不到处理而导致饥饿。(这个问题不是epoll特有的)
The solution is to maintain a ready list and mark the file descriptor as ready in its associated data structure, thereby allowing the application to remember which filesneed to be processed but still round robin amongst all the ready files. This also supports ignoring subsequent events you receive for file descriptors that are alreadyready.
解决方案是维护一个就绪列表,并在它关联的数据结构中标记文件描述符已就绪,从而允许应用程序记住那个文件需要被处理,但还在所有就绪文件中循环竞争。这样也支持对那些已就绪的文件描述符忽略你收到的后续事件.
o If using an event cache...
如果使用一个事件缓存...
If you use an event cache or store all the file descriptors returned from epoll_wait(2), then make sure to provide a way to mark its closure dynamically (i.e., caused by aprevious event's processing). Suppose you receive 100 events from epoll_wait(2), and in event #47 a condition causes event #13 to be closed. If you remove the structureand close(2) the file descriptor for event #13, then your event cache might still say there are events waiting for that file descriptor causing confusion.
如果你使用一个事件缓存或存储所有从epoll_wait(2)返回的文件描述符, 那么要确信提供一个方法去标记它的动态关闭(例如,在前一个事件处理中导致的)。假设你从epoll_wait(2)收到100个事件,并且在#47事件中一个条件导致#13事件关闭。如果你移除数据结构并关闭事件#13的文件描述符,那么你的事件缓存可能任然说还有事件在等待那个文件描述符,导致混乱。
One solution for this is to call, during the processing of event 47, epoll_ctl(EPOLL_CTL_DEL) to delete file descriptor 13 and close(2), then mark its associated datastructure as removed and link it to a cleanup list. If you find another event for file descriptor 13 in your batch processing, you will discover the file descriptor hadbeen previously removed and there will be no confusion.
这个问题的一个解决方案是,在#47事件的处理过程中,调用epoll_ctl(EPOLL_CTL_DEL)去删除文件描述符13并close(2),然后标记它的关联数据结构为已移除,并链接到一个cleanup list.如果在你的批量处理中发现#13文件描述符的事件,你将发现文件描述符在之前已经移除,就不会混乱。
翻译:6700662@qq.com, 转载请注明出处。
VERSIONS
The epoll API was introduced in Linux kernel 2.5.44. Support was added to glibc in version 2.3.2.
CONFORMING TO
The epoll API is Linux-specific. Some other systems provide similar mechanisms, for example, FreeBSD has kqueue, and Solaris has /dev/poll.
SEE ALSO
epoll_create(2), epoll_create1(2), epoll_ctl(2), epoll_wait(2), poll(2), select(2)
COLOPHON
This page is part of release 4.04 of the Linux man-pages project. A description of the project, information about reporting bugs, and the latest version of this page, can
be found at http://www.kernel.org/doc/man-pages/.
Linux 2015-04-19 EPOLL(7)