8 years ago · ac61c07204
--- a/doc/devel/bison.dox
+++ b/doc/devel/bison.dox
@@ -9,14 +9,14 @@
 
				 
			
 
				 @section parserIntro Parser background
			
 
				 
			
 
				-Kea's format of choice is JSON, which is used in configuration files, in the
			
 
				-command channel and also when communicating between DHCP servers and DHCP-DDNS
			
 
				-component. It is almost certain that it will be used as the syntax for any
			
 
				-upcoming features.
			
 
				+Kea's data format of choice is JSON (https://tools.ietf.org/html/rfc7159), which
			
 
				+is used in configuration files, in the command channel and also when
			
 
				+communicating between DHCP servers and DHCP-DDNS component. It is almost certain
			
 
				+it will be used as the data format for any new features.
			
 
				 
			
 
				 Historically, Kea used @ref isc::data::Element::fromJSON and @ref
			
 
				 isc::data::Element::fromJSONFile methods to parse received data that is expected
			
 
				-to be in JSON syntax. This in-house parser was developed back in early BIND10
			
 
				+to be in JSON syntax. This in-house parser was developed back in the early BIND10
			
 
				 days. Its two main advantages were that it didn't have any external dependencies
			
 
				 and that it was already available in the source tree when the Kea project
			
 
				 started. On the other hand, it was very difficult to modify (several attempts to
			
@@ -49,9 +49,9 @@ and here: http://kea.isc.org/wiki/SimpleParser.
 
				 To solve the issue of phase 1 mentioned earlier, a new parser has been developed
			
 
				 that is based on flex and bison tools. The following text uses DHCPv6 as an
			
 
				 example, but the same principle applies to DHCPv4 and D2 and CA will likely to
			
 
				-follow. The new parser consists of two core elements (the following description
			
 
				-is slightly oversimplified to convey the intent, more detailed description
			
 
				-is available in the following sections):
			
 
				+follow. The new parser consists of two core elements with a wrapper around them
			
 
				+(the following description is slightly oversimplified to convey the intent, more
			
 
				+detailed description is available in the following sections):
			
 
				 
			
 
				 -# Flex lexer (src/bin/dhcp6/dhcp6_lexer.ll) that is essentially a set of
			
 
				    regular expressions with C++ code that creates new tokens that represent whatever
			
@@ -87,20 +87,23 @@ is available in the following sections):
 
				    (a token with a value of 100), RCURLY_BRACKET, RCURLY_BRACKET, END
			
 
				 
			
 
				 -# Parser context. As there is some information that needs to be passed between
			
 
				-   parser and lexer, @ref isc::dhcp::Parser6Context is a convenient to wrapper
			
 
				+   parser and lexer, @ref isc::dhcp::Parser6Context is a convenience wrapper
			
 
				    around those two bundled together. It also works as a nice encapsulation,
			
 
				    hiding all the flex/bison details underneath.
			
 
				 
			
 
				 @section parserBuild Building flex/bison code
			
 
				 
			
 
				-The only input file used by flex is the .ll file. The only input file used
			
 
				-by bison is the .yy file. When processed, those two tools will generate
			
 
				-a number of .hh and .cc files. The major ones are names the same as their
			
 
				-.ll and .yy counterparts (e.g. dhcp6_lexer.cc, dhcp6_parser.cc and dhcp6_parser.h),
			
 
				-but there's a number of additional files created: location.hh, position.hh
			
 
				-and stack.hh. Those are internal bison headers that are needed. To avoid every
			
 
				-user to have flex and bison installed, we chose to generate the files and
			
 
				-add them to the Kea repository. To generate those files, do the following:
			
 
				+The only input file used by flex is the .ll file. The only input file used by
			
 
				+bison is the .yy file. When making changes to the lexer or parser, only those
			
 
				+two files are edited. When processed, those two tools will generate a number of
			
 
				+.hh and .cc files. The major ones are named the same as their .ll and .yy
			
 
				+counterparts (e.g. dhcp6_lexer.cc, dhcp6_parser.cc and dhcp6_parser.h), but
			
 
				+there's a number of additional files created: location.hh, position.hh and
			
 
				+stack.hh. Those are internal bison headers that are needed for compilation.
			
 
				+
			
 
				+To avoid every user to have flex and bison installed, we chose to generate the
			
 
				+files and add them to the Kea repository. To generate those files, do the
			
 
				+following:
			
 
				 
			
 
				 @code
			
 
				 ./configure --enable-generate-parser
			
@@ -120,7 +123,9 @@ generated may be different and cause unnecessarily large diffs, may cause
 
				 coverity/cpp-check issues appear and disappear and cause general unhappiness.
			
 
				 To avoid those problems, we will introduce a requirement to generate flex/bison
			
 
				 files on one dedicated machine. This machine will likely be docs. Currently Ops
			
 
				-is working on installing the necessary versions of flex/bison required
			
 
				+is working on installing the necessary versions of flex/bison required, but
			
 
				+for the time being we can use the versions installed in Francis' home directory
			
 
				+(export PATH=/home/fdupont/bin:$PATH).
			
 
				 
			
 
				 Note: the above applies only to the code being merged on master. It is probably
			
 
				 ok to generate the files on your development branch with whatever version you
			
@@ -145,10 +150,10 @@ documented, but the docs for it may be a bit cryptic. When developing new
 
				 parsers, it's best to start by copying whatever we have for DHCPv6 and tweak as
			
 
				 needed.
			
 
				 
			
 
				-Second addition are flex conditions. They're defined with %x and they define a
			
 
				+Second addition are flex conditions. They're defined with %%x and they define a
			
 
				 state of the lexer. A good example of a state may be comment. Once the lexer
			
 
				-detects that a comment has started, it switches to certain condition (by calling
			
 
				-BEGIN(COMMENT) for example) and the code should ignore whatever follows
			
 
				+detects that a comment's beginning, it switches to a certain condition (by calling
			
 
				+BEGIN(COMMENT) for example) and the code then ignores whatever follows
			
 
				 (especially strings that look like valid tokens) until the comment is closed
			
 
				 (when it returns to the default condition by calling BEGIN(INITIAL)). This is
			
 
				 something that is not frequently used and the only use cases for it are the
			
@@ -157,7 +162,7 @@ forementioned comments and file inclusions.
 
				 Second addition are parser contexts. Let's assume we have a parser that uses
			
 
				 "ip-address" regexp that would return IP_ADDRESS token. Whenever we want to
			
 
				 allow "ip-address", the grammar allows IP_ADDRESS token to appear. When the
			
 
				-lexer is called, it will match the regexp, will generate IP_ADDRESS token and
			
 
				+lexer is called, it will match the regexp, will generate the IP_ADDRESS token and
			
 
				 the parser will carry out its duty. This works fine as long as you have very
			
 
				 specific grammar that defines everything. Sadly, that's not the case in DHCP as
			
 
				 we have hooks. Hook libraries can have parameters that are defined by third
			
@@ -193,7 +198,7 @@ in src/bin/dhcp6/dhcp6_parser.yy. Here's a simplified excerpt of it:
 
				 dhcp6_object: DHCP6 COLON LCURLY_BRACKET global_params RCURLY_BRACKET;
			
 
				 
			
 
				 // This defines all parameters that may appear in the Dhcp6 object.
			
 
				-// It can either contain a global_param (defined below) or a 
			
 
				+// It can either contain a global_param (defined below) or a
			
 
				 // global_params list, followed by a comma followed by a global_param.
			
 
				 // Note this definition is recursive and can expand to a single
			
 
				 // instance of global_param or multiple instances separated by commas.
			
@@ -201,7 +206,7 @@ dhcp6_object: DHCP6 COLON LCURLY_BRACKET global_params RCURLY_BRACKET;
 
				 global_params: global_param
			
 
				              | global_params COMMA global_param
			
 
				              ;
			
 
				-    
			
 
				+
			
 
				 // These are the parameters that are allowed in the top-level for
			
 
				 // Dhcp6.
			
 
				 global_param: preferred_lifetime
			
@@ -222,9 +227,9 @@ global_param: preferred_lifetime
 
				             | server_id
			
 
				             | dhcp4o6_port
			
 
				             ;
			
 
				-    
			
 
				+
			
 
				 renew_timer: RENEW_TIMER COLON INTEGER;
			
 
				-   
			
 
				+
			
 
				 // Many other definitions follow.
			
 
				 @endcode
			
 
				 
			
@@ -244,7 +249,7 @@ rule.
 
				 
			
 
				 The "leaf" rules that don't contain any other rules, must be defined by a
			
 
				 series of tokens. An example of such a rule is renew_timer above. It is defined
			
 
				-as a series of 3 tokens: RENEW_TIMER, COLON and INTEGER. 
			
 
				+as a series of 3 tokens: RENEW_TIMER, COLON and INTEGER.
			
 
				 
			
 
				 Speaking of integers, it is worth noting that some tokens can have values. Those
			
 
				 values are defined using %token clause.  For example, dhcp6_parser.yy has the
			
@@ -272,7 +277,7 @@ renew_timer with some extra code:
 
				 @code
			
 
				 renew_timer: RENEW_TIMER {
			
 
				    cout << "renew-timer token detected, so far so good" << endl;
			
 
				-} COLON { 
			
 
				+} COLON {
			
 
				    cout << "colon detected!" << endl;
			
 
				 } INTEGER {
			
 
				     uint32_t timer = $3;
			
@@ -298,11 +303,11 @@ ncr_protocol: NCR_PROTOCOL {
 
				     ctx.enter(ctx.NCR_PROTOCOL); (1)
			
 
				 } COLON ncr_protocol_value {
			
 
				     ctx.stack_.back()->set("ncr-protocol", $4); (3)
			
 
				-    ctx.leave();
			
 
				+    ctx.leave(); (4)
			
 
				 };
			
 
				 
			
 
				 ncr_protocol_value:
			
 
				-    UDP { $$ = ElementPtr(new StringElement("UDP", ctx.loc2pos(@1))); } 
			
 
				+    UDP { $$ = ElementPtr(new StringElement("UDP", ctx.loc2pos(@1))); }
			
 
				   | TCP { $$ = ElementPtr(new StringElement("TCP", ctx.loc2pos(@1))); } (2)
			
 
				   ;
			
 
				 @endcode
			
@@ -358,8 +363,8 @@ The first line creates an instance of IntElement with a value of the token.  The
 
				 second line adds it to the current map (current = the last on the stack).  This
			
 
				 approach has a very nice property of being generic. This rule can be referenced
			
 
				 from global and subnet scope (and possibly other scopes as well) and the code
			
 
				-will add the IntElement object to whatever is last on the stack, be it
			
 
				-global, subnet or perhaps even something else (maybe we will allow preferred
			
 
				+will add the IntElement object to whatever is last on the stack, be it global,
			
 
				+subnet or perhaps even something else (maybe one day we will allow preferred
			
 
				 lifetime to be defined on a per pool or per host basis?).
			
 
				 
			
 
				 @section parserSubgrammar Parsing partial grammar
			
@@ -385,6 +390,9 @@ This trick is also implemented in the lexer. There's a flag called start_token_f
 
				 When initially set to true, it will cause the lexer to emit an artificial
			
 
				 token once, before parsing any input whatsoever.
			
 
				 
			
 
				+This optional feature can be skipped altogether if you don't plan to parse parts
			
 
				+of the configuration.
			
 
				+
			
 
				 @section parserBisonExtend Extending grammar
			
 
				 
			
 
				 Adding new parameters to existing parsers is very easy once you get hold of the
			
@@ -402,7 +410,7 @@ Here's the complete set of necessary changes.
 
				    @code
			
 
				    SUBNET_4O6_INTERFACE_ID "4o6-interface-id"
			
 
				    @endcode
			
 
				-   This defines a token called SUBNET_4O6_INTERFACE_ID that, when needed to 
			
 
				+   This defines a token called SUBNET_4O6_INTERFACE_ID that, when needed to
			
 
				    be printed, will be represented as "4o6-interface-id".
			
 
				 
			
 
				 2. Tell lexer how to recognize the new parameter:
			
@@ -439,7 +447,7 @@ Here's the complete set of necessary changes.
 
				    weird that happens to match our reserved keywords. Therefore we switch to
			
 
				    no keyword context. This tells the lexer to interpret everything as string,
			
 
				    integer or float.
			
 
				-   
			
 
				+
			
 
				 4. Finally, extend the existing subnet4_param that defines all allowed parameters
			
 
				    in Subnet4 scope to also cover our new parameter (the new line marked with *):
			
 
				    @code