--- Relevant portions of RFC2616 --- OCTET = CHAR = UPALPHA = LOALPHA = ALPHA = UPALPHA | LOALPHA DIGIT = CTL = CR = LF = SP = HT = <"> = CRLF = CR LF LWS = [CRLF] 1*( SP | HT ) TEXT = HEX = "A" | "B" | "C" | "D" | "E" | "F" | "a" | "b" | "c" | "d" | "e" | "f" | DIGIT separators = "(" | ")" | "<" | ">" | "@" | "," | ";" | ":" | "\" | <"> | "/" | "[" | "]" | "?" | "=" | "{" | "}" | SP | HT token = 1* quoted-pair = "\" CHAR ctext = qdtext = > quoted-string = ( <"> *(qdtext | quoted-pair ) <"> ) comment = "(" *( ctext | quoted-pair | comment ) ")" 4 HTTP Message 4.1 Message Types HTTP messages consist of requests from client to server and responses from server to client. Request (section 5) and Response (section 6) messages use the generic message format of RFC 822 [9] for transferring entities (the payload of the message). Both types of message consist of : - a start-line - zero or more header fields (also known as "headers") - an empty line (i.e., a line with nothing preceding the CRLF) indicating the end of the header fields - and possibly a message-body. HTTP-message = Request | Response start-line = Request-Line | Status-Line generic-message = start-line *(message-header CRLF) CRLF [ message-body ] In the interest of robustness, servers SHOULD ignore any empty line(s) received where a Request-Line is expected. In other words, if the server is reading the protocol stream at the beginning of a message and receives a CRLF first, it should ignore the CRLF. 4.2 Message headers - Each header field consists of a name followed by a colon (":") and the field value. - Field names are case-insensitive. - The field value MAY be preceded by any amount of LWS, though a single SP is preferred. - Header fields can be extended over multiple lines by preceding each extra line with at least one SP or HT. message-header = field-name ":" [ field-value ] field-name = token field-value = *( field-content | LWS ) field-content = The field-content does not include any leading or trailing LWS occurring before the first non-whitespace character of the field-value or after the last non-whitespace character of the field-value. Such leading or trailing LWS MAY be removed without changing the semantics of the field value. Any LWS that occurs between field-content MAY be replaced with a single SP before interpreting the field value or forwarding the message downstream. => format des headers = 1*(CHAR & !ctl & !sep) ":" *(OCTET & (!ctl | LWS)) => les regex de matching de headers s'appliquent sur field-content, et peuvent utiliser field-value comme espace de travail (mais de préférence après le premier SP). (19.3) The line terminator for message-header fields is the sequence CRLF. However, we recommend that applications, when parsing such headers, recognize a single LF as a line terminator and ignore the leading CR. message-body = entity-body | 5 Request Request = Request-Line *(( general-header | request-header | entity-header ) CRLF) CRLF [ message-body ] 5.1 Request line The elements are separated by SP characters. No CR or LF is allowed except in the final CRLF sequence. Request-Line = Method SP Request-URI SP HTTP-Version CRLF (19.3) Clients SHOULD be tolerant in parsing the Status-Line and servers tolerant when parsing the Request-Line. In particular, they SHOULD accept any amount of SP or HT characters between fields, even though only a single SP is required. 4.5 General headers Apply to MESSAGE. general-header = Cache-Control | Connection | Date | Pragma | Trailer | Transfer-Encoding | Upgrade | Via | Warning General-header field names can be extended reliably only in combination with a change in the protocol version. However, new or experimental header fields may be given the semantics of general header fields if all parties in the communication recognize them to be general-header fields. Unrecognized header fields are treated as entity-header fields. 5.3 Request Header Fields The request-header fields allow the client to pass additional information about the request, and about the client itself, to the server. These fields act as request modifiers, with semantics equivalent to the parameters on a programming language method invocation. request-header = Accept | Accept-Charset | Accept-Encoding | Accept-Language | Authorization | Expect | From | Host | If-Match | If-Modified-Since | If-None-Match | If-Range | If-Unmodified-Since | Max-Forwards | Proxy-Authorization | Range | Referer | TE | User-Agent Request-header field names can be extended reliably only in combination with a change in the protocol version. However, new or experimental header fields MAY be given the semantics of request-header fields if all parties in the communication recognize them to be request-header fields. Unrecognized header fields are treated as entity-header fields. 7.1 Entity header fields Entity-header fields define metainformation about the entity-body or, if no body is present, about the resource identified by the request. Some of this metainformation is OPTIONAL; some might be REQUIRED by portions of this specification. entity-header = Allow | Content-Encoding | Content-Language | Content-Length | Content-Location | Content-MD5 | Content-Range | Content-Type | Expires | Last-Modified | extension-header extension-header = message-header The extension-header mechanism allows additional entity-header fields to be defined without changing the protocol, but these fields cannot be assumed to be recognizable by the recipient. Unrecognized header fields SHOULD be ignored by the recipient and MUST be forwarded by transparent proxies. ---------------------------------- The format of Request-URI is defined by RFC3986 : URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] hier-part = "//" authority path-abempty / path-absolute / path-rootless / path-empty URI-reference = URI / relative-ref absolute-URI = scheme ":" hier-part [ "?" query ] relative-ref = relative-part [ "?" query ] [ "#" fragment ] relative-part = "//" authority path-abempty / path-absolute / path-noscheme / path-empty scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) authority = [ userinfo "@" ] host [ ":" port ] userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) host = IP-literal / IPv4address / reg-name port = *DIGIT IP-literal = "[" ( IPv6address / IPvFuture ) "]" IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) IPv6address = 6( h16 ":" ) ls32 / "::" 5( h16 ":" ) ls32 / [ h16 ] "::" 4( h16 ":" ) ls32 / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 / [ *4( h16 ":" ) h16 ] "::" ls32 / [ *5( h16 ":" ) h16 ] "::" h16 / [ *6( h16 ":" ) h16 ] "::" h16 = 1*4HEXDIG ls32 = ( h16 ":" h16 ) / IPv4address IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet dec-octet = DIGIT ; 0-9 / %x31-39 DIGIT ; 10-99 / "1" 2DIGIT ; 100-199 / "2" %x30-34 DIGIT ; 200-249 / "25" %x30-35 ; 250-255 reg-name = *( unreserved / pct-encoded / sub-delims ) path = path-abempty ; begins with "/" or is empty / path-absolute ; begins with "/" but not "//" / path-noscheme ; begins with a non-colon segment / path-rootless ; begins with a segment / path-empty ; zero characters path-abempty = *( "/" segment ) path-absolute = "/" [ segment-nz *( "/" segment ) ] path-noscheme = segment-nz-nc *( "/" segment ) path-rootless = segment-nz *( "/" segment ) path-empty = 0 segment = *pchar segment-nz = 1*pchar segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) ; non-zero-length segment without any colon ":" pchar = unreserved / pct-encoded / sub-delims / ":" / "@" query = *( pchar / "/" / "?" ) fragment = *( pchar / "/" / "?" ) pct-encoded = "%" HEXDIG HEXDIG unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" => so the list of allowed characters in a URI is : uri-char = unreserved / gen-delims / sub-delims / "%" = ALPHA / DIGIT / "-" / "." / "_" / "~" / ":" / "/" / "?" / "#" / "[" / "]" / "@" / "!" / "$" / "&" / "'" / "(" / ")" / / "*" / "+" / "," / ";" / "=" / "%" Note that non-ascii characters are forbidden ! Spaces and CTL are forbidden. Unfortunately, some products such as Apache allow such characters :-/ ---- The correct way to do it ---- - one http_session It is basically any transport session on which we talk HTTP. It may be TCP, SSL over TCP, etc... It knows a way to talk to the client, either the socket file descriptor or a direct access to the client-side buffer. It should hold information about the last accessed server so that we can guarantee that the same server can be used during a whole session if needed. A first version without optimal support for HTTP pipelining will have the client buffers tied to the http_session. It may be possible that it is not sufficient for full pipelining, but this will need further study. The link from the buffers to the backend should be managed by the http transaction (http_txn), provided that they are serialized. Each http_session, has 0 to N http_txn. Each http_txn belongs to one and only one http_session. - each http_txn has 1 request message (http_req), and 0 or 1 response message (http_rtr). Each of them has 1 and only one http_txn. An http_txn holds informations such as the HTTP method, the URI, the HTTP version, the transfer-encoding, the HTTP status, the authorization, the req and rtr content-length, the timers, logs, etc... The backend and server which process the request are also known from the http_txn. - both request and response messages hold header and parsing informations, such as the parsing state, start of headers, start of message, captures, etc...