bison.dox 24 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449
  1. // Copyright (C) 2017 Internet Systems Consortium, Inc. ("ISC")
  2. //
  3. // This Source Code Form is subject to the terms of the Mozilla Public
  4. // License, v. 2.0. If a copy of the MPL was not distributed with this
  5. // file, You can obtain one at http://mozilla.org/MPL/2.0/.
  6. /**
  7. @page parser Flex/Bison Parsers
  8. @section parserIntro Parser background
  9. Kea's data format of choice is JSON (defined in https://tools.ietf.org/html/rfc7159), which is used
  10. in configuration files, in the command channel and also when communicating between the DHCP servers
  11. and the DHCP-DDNS component. It is almost certain to be used as the data format for any new
  12. features.
  13. Historically, Kea used the @ref isc::data::Element::fromJSON and @ref
  14. isc::data::Element::fromJSONFile methods to parse data expected to be in JSON syntax. This in-house
  15. parser was developed back in the early days of Kea when it was part of BIND 10. Its main advantages
  16. were that it didn't have any external dependencies and that it was already available in the source
  17. tree when Kea development started. On the other hand, it was very difficult to modify (several
  18. attempts to implement more robust comments had failed) and lacked a number of features. Also, it was
  19. a pure JSON parser, so accepted anything as long as the content was correct JSON. (This caused some
  20. problems: for example, the syntactic checks were conducted late in the parsing process, by which
  21. time some of the information, e.g. line numbers, was no longer available. To print meaningful error
  22. messages, the Kea team had to develop a way to store filename, line and column information.
  23. Unfortunately this gave rise to other problems such as data duplication.) The output from these
  24. parsers was a tree of @ref isc::data::Element objects using shared pointers. This part of the
  25. processing we can refer to as phase 1.
  26. The Element tree was then processed by set of dedicated parsers. Each
  27. parser was able to handle its own context, e.g. global, subnet list,
  28. subnet, pool etc. This step took the tree generated in phase 1, parsed
  29. it and generated an output configuration (e.g. @ref
  30. isc::dhcp::SrvConfig) or dynamic structures
  31. (e.g. isc::data::Host). During this stage, a large number of parser
  32. objects derived from DhcpConfigParser could be instantiated for each
  33. scope and instance of data (e.g. to parse 1000 host reservation
  34. entries a thousand dedicated parsers were created). For convenience,
  35. this step is called phase 2.
  36. Other issues with the old parsers are discussed here: @ref dhcpv6ConfigParserBison (this section is
  37. focused on DHCPv6, but the same issues affected DHCPv4 and D2) and here:
  38. http://kea.isc.org/wiki/SimpleParser.
  39. @section parserBisonIntro Flex/Bison Based Parser
  40. To solve the issue of phase 1 mentioned earlier, a new parser has been developed that is based on
  41. the "flex and "bison" tools. The following text uses DHCPv6 as an example, but the same principle
  42. applies to DHCPv4 and D2; CA will likely to follow. The new parser consists of two core elements
  43. with a wrapper around them. The following descriptions are slightly oversimplified in order to
  44. convey the intent; a more detailed description is available in subsequent sections.
  45. -# Flex lexical analyzer (src/bin/dhcp6/dhcp6_lexer.ll): this is essentially a set of
  46. regular expressions and C++ code that creates new tokens that represent whatever
  47. was just parsed. This lexical analyzer (lexer) will be called iteratively by bison until the whole
  48. input text is parsed or an error is encountered. For example, a snippet of the
  49. code might look like this:
  50. @code
  51. \"socket-type\" {
  52. return isc::dhcp::Dhcp6Parser::make_SOCKET_TYPE(driver.loc_);
  53. }
  54. @endcode
  55. This tells the flex that if encounters "socket-type" (quoted), then it should
  56. create a token SOCKET_TYPE and pass to it its current location (that's the
  57. file name, line and column numbers).
  58. -# Bison grammar (src/bin/dhcp6/dhcp6_parser.yy): the module that defines the syntax. Grammar and
  59. syntax are perhaps fancy words, but they simply define what is allowed and where. Bison grammar
  60. starts with a list of tokens. Those tokens are defined only by name ("here's the list of possible
  61. tokens that could appear"). What constitutes a token is actually defined in the lexer. The
  62. grammar define how the incoming tokens are expected to fall into their places together. Let's
  63. take an example of the following input text:
  64. @code
  65. {
  66. "Dhcp6":
  67. {
  68. "renew-timer": 100
  69. }
  70. }
  71. @endcode
  72. The lexer would generate the following sequence of tokens: LCURLY_BRACKET, DHCP6, COLON,
  73. LCURLY_BRACKET, RENEW_TIMER, COLON, INTEGER (a token with a value of 100), RCURLY_BRACKET,
  74. RCURLY_BRACKET, END. The bison grammar recognizes that the sequence forms a valid sentence and
  75. that there are no errors and act upon it. (Whereas if the left and right braces in the above
  76. example were exchanged, the bison module would identify the sequence as syntactically incorrect.)
  77. -# Parser context. As there is some information that needs to be passed between parser and lexer,
  78. @ref isc::dhcp::Parser6Context is a convenience wrapper around those two bundled together. It
  79. also works as a nice encapsulation, hiding all the flex/bison details underneath.
  80. @section parserBuild Building Flex/Bison Code
  81. The only input file used by flex is the .ll file and the only input file used by bison is the .yy
  82. file. When making changes to the lexer or parser, only those two files are edited. When processed,
  83. the two tools generate a number of .h, .hh and .cc files. The major ones have the same name as their
  84. .ll and .yy counterparts (e.g. dhcp6_lexer.cc, dhcp6_parser.cc and dhcp6_parser.h etc.), but a
  85. number of additional files are also created: location.hh, position.hh and stack.hh. Those are
  86. internal bison headers that are needed for compilation.
  87. To avoid the need for every user to have flex and bison installed, the output files are generated
  88. when the .ll or .yy files are altered and are stored in the Kea repository. To generate those files,
  89. issue the following sequence of commands from the top-level Kea directory:
  90. @code
  91. ./configure --enable-generate-parser
  92. cd src/bin/dhcp6
  93. make parser
  94. @endcode
  95. Strictly speaking, the comment "make parser" is not necessary. If you updated the .ll or .yy file,
  96. the regular "make" command should pick those changes up. However, since one source file generates
  97. multiple output files and you are likely to be using a multi-process build (by specifying the "-j"
  98. switch on the "make" command), there may be odd side effects: explicitly rebuilding the files
  99. manually by using "make parser" avoids any trouble.
  100. One problem brought on by use of flex/bison is tool version dependency. If one developer uses
  101. version A of those tools and another developer uses B, the files generated by the different version
  102. may be significantly different. This causes all sorts of problems, e.g. coverity/cpp-check issues
  103. may appear and disappear: in short, it can cause all sorts of general unhappiness. To avoid those
  104. problems, the Kea team generates the flex/bison files on a dedicated machine. See KeaRegen page
  105. on ISC internal wiki for details.
  106. @section parserFlex Flex Detailed
  107. Earlier sections described the lexer in a bit of an over-simplified way. The .ll file contains a
  108. number of elements in addition to the regular expressions and they're not as simple as was
  109. described.
  110. The file starts with a number of sections separated by percent (%) signs. Depending on which section
  111. code is written in, it may be interpreted by flex, copied verbatim to the output .cc file, copied to
  112. the output .h file or copied to both.
  113. There is an initial section that defines flex options. These are somewhat documented, but the
  114. documentation for it may be a bit cryptic. When developing new parsers, it's best to start by
  115. copying whatever we have for DHCPv6 and tweak as needed.
  116. Next comes the flex conditions. They are defined with %%x and they define a state of the lexer. A
  117. good example of a state may be comment. Once the lexer detects that a comment's beginning, it
  118. switches to a certain condition (by calling BEGIN(COMMENT) for example) and the code then ignores
  119. whatever follows (especially strings that look like valid tokens) until the comment is closed (when
  120. it returns to the default condition by calling BEGIN(INITIAL)). This is something that is not
  121. frequently used and the only use cases for it are the forementioned comments and file inclusions.
  122. After this come the syntactic contexts. Let's assume we have a parser that uses an "ip-address"
  123. regular expression (regexp) that would return the IP_ADDRESS token. Whenever we want to allow
  124. "ip-address", the grammar allows the IP_ADDRESS token to appear. When the lexer is called, it will
  125. match the regexp, generate the IP_ADDRESS token and the parser will carry out its duty. This works
  126. fine as long as you have very specific grammar that defines everything. Sadly, that's not the case
  127. in DHCP as we have hooks. Hook libraries can have parameters that are defined by third party
  128. developers and they can pick whatever parameter names they want, including "ip-address". Another
  129. example could be Dhcp4 and Dhcp6 configurations defined in a single file. The grammar defining
  130. "Dhcp6" main contain a clause that says "Dhcp4" may contain any generic JSON. However, the lexer may
  131. find the "ip-address" string in the "Dhcp4" configuration and will say that it's not a part of
  132. generic JSON, but a dedicated IP_ADDRESS token instead. The parser will then complain and the whole
  133. thing would end up in failure. It was to solve this problem that syntactic contexts were introduced.
  134. They tell the lexer whether input strings have specific or generic meaning. For example, when
  135. parsing host reservations, the lexer is expected to report the IP_ADDRESS token if "ip-address" is
  136. detected. However, when parsing generic JSON, upon encountering "ip-address" it should return a
  137. STRING with a value of "ip-address". The list of all contexts is enumerated in @ref
  138. isc::dhcp::Parser6Context::ParserContext.
  139. For a DHCPv6-specific description of the conflict avoidance, see @ref dhcp6ParserConflicts.
  140. @section parserGrammar Bison Grammar
  141. Bison has much better documentation than flex. Its latest version seems to be available here:
  142. https://www.gnu.org/software/bison/manual. Bison is a LALR(1) parser, which essentially means that
  143. it is able to parse (separate and analyze) any text that is described by set of rules. You can see
  144. the more formal description here: https://en.wikipedia.org/wiki/LALR_parser, but the plain English
  145. explanation is that you define a set of rules and bison will walk through input text trying to match
  146. the content to those rules. While doing so, it will be allowed to peek at most one symbol (token)
  147. ahead.
  148. As an example, let's take a closer look at the bison grammar we have for DHCPv6. It is defined
  149. in src/bin/dhcp6/dhcp6_parser.yy. Here's a simplified excerpt:
  150. @code
  151. // This defines a global Dhcp6 object.
  152. dhcp6_object: DHCP6 COLON LCURLY_BRACKET global_params RCURLY_BRACKET;
  153. // This defines all parameters that may appear in the Dhcp6 object.
  154. // It can either contain a global_param (defined below) or a
  155. // global_params list, followed by a comma followed by a global_param.
  156. // Note this definition is recursive and can expand to a single
  157. // instance of global_param or multiple instances separated by commas.
  158. // This is how bison handles variable number of parameters.
  159. global_params: global_param
  160. | global_params COMMA global_param
  161. ;
  162. // These are the parameters that are allowed in the top-level for
  163. // Dhcp6.
  164. global_param: preferred_lifetime
  165. | valid_lifetime
  166. | renew_timer
  167. | rebind_timer
  168. | subnet6_list
  169. | interfaces_config
  170. | lease_database
  171. | hosts_database
  172. | mac_sources
  173. | relay_supplied_options
  174. | host_reservation_identifiers
  175. | client_classes
  176. | option_data_list
  177. | hooks_libraries
  178. | expired_leases_processing
  179. | server_id
  180. | dhcp4o6_port
  181. ;
  182. renew_timer: RENEW_TIMER COLON INTEGER;
  183. // Many other definitions follow.
  184. @endcode
  185. The code above defines parameters that may appear in the Dhcp6 object declaration. One important
  186. trick to understand is understand the way to handle variable number of parameters. In bison it is
  187. most convenient to present them as recursive lists: in this example, global_params defined in a way
  188. that allows any number of global_param instances allowing the grammar to be easily extensible. If
  189. one needs to add a new global parameter, just add it to the global_param list.
  190. This type of definition has several levels, each representing logical structure of the configuration
  191. data. We start with global scope, then step into a Dhcp6 object that has a Subnet6 list, which in
  192. turn has Subnet6 instances, each of which has pools list and so on. Each level is represented as a
  193. separate rule.
  194. The "leaf" rules (that don't contain any other rules) must be defined by a series of tokens. An
  195. example of such a rule is renew_timer, above. It is defined as a series of 3 tokens: RENEW_TIMER,
  196. COLON and INTEGER.
  197. Speaking of integers, it is worth noting that some tokens can have values. Those values are defined
  198. using %token clause. For example, dhcp6_parser.yy contains the following:
  199. @code
  200. %token <std::string> STRING "constant string"
  201. %token <int64_t> INTEGER "integer"
  202. %token <double> FLOAT "floating point"
  203. %token <bool> BOOLEAN "boolean"
  204. @endcode
  205. The first line says that the token STRING has a type of std::string and when referring to this token
  206. in error messages, it should be printed as "constant string".
  207. In principle, it is valid to define just the grammar without any corresponding C++ code to it. Bison
  208. will go through the whole input text, match the rules and will either say the input adhered to the
  209. rules (parsing successful) or not (parsing failed). This may be a useful step when developing new
  210. parser, but it has no practical value. To perform specific actions, bison allows the injection of
  211. C++ code at almost any point. For example we could augment the parsing of renew_timer with some
  212. extra code:
  213. @code
  214. renew_timer: RENEW_TIMER {
  215. cout << "renew-timer token detected, so far so good" << endl;
  216. } COLON {
  217. cout << "colon detected!" << endl;
  218. } INTEGER {
  219. uint32_t timer = $3;
  220. cout << "Got the renew-timer value: " << timer << endl;
  221. ElementPtr prf(new IntElement($3, ctx.loc2pos(@3)));
  222. ctx.stack_.back()->set("renew-timer", prf);
  223. };
  224. @endcode
  225. This example showcases several important things. First, the ability to insert code at almost any
  226. step is very useful. It's also a powerful debugging tool.
  227. Second, some tokens are valueless (e.g. "renew-timer" when represented as the RENEW_TIMER token has
  228. no value), but some have values. In particular, the INTEGER token has value which can be extracted
  229. by $ followed by a number that represents its order, so $3 means "a value of third token or action
  230. in this rule". If needed, the location of specific token (filename, line and column) can be
  231. accessed with @ followed by a number that represents token number, e.g. @3 in the example above
  232. returns exact location of INTEGER token.
  233. Also, some rules may have values. This is not used often, but there are specific cases when it's
  234. convenient. Let's take a look at the following excerpt from dhcp6_parser.yy:
  235. @code
  236. ncr_protocol: NCR_PROTOCOL {
  237. ctx.enter(ctx.NCR_PROTOCOL); (1)
  238. } COLON ncr_protocol_value {
  239. ctx.stack_.back()->set("ncr-protocol", $4); (3)
  240. ctx.leave(); (4)
  241. };
  242. ncr_protocol_value:
  243. UDP { $$ = ElementPtr(new StringElement("UDP", ctx.loc2pos(@1))); }
  244. | TCP { $$ = ElementPtr(new StringElement("TCP", ctx.loc2pos(@1))); } (2)
  245. ;
  246. @endcode
  247. (The numbers in brackets at the end of some lines do not appear in the code; they are used identify
  248. the statements in the following discussion.)
  249. The "ncr-protocol" parameter accepts one of two values: either tcp or udp. To handle such a case, we
  250. first enter the NCR_PROTOCOL context to tell the lexer that we're in this scope. The lexer will then
  251. know that any incoming string of text that is either "UDP" or "TCP" should be represented as one of
  252. the TCP or UDP tokens. The parser knows that after NCR_PROTOCOL there will be a colon followed by an
  253. ncr_protocol_value. The rule for ncr_protocol_value says it can be either the TCP token or the UDP
  254. token. Let's assume the input text is:
  255. @code
  256. "ncr-protocol": "TCP"
  257. @endcode
  258. Here's how the parser will handle it. First, it will attempt to match the rule for ncr_protocol. It
  259. will discover the first token is NCR_PROTOCOL. As a result, it will run the code (1), which will
  260. tell lexer to parse incoming tokens as ncr protocol values. The next token is expected to be COLON
  261. and the one after that the ncr_protocol_value. The lexer has already been switched into the
  262. NCR_PROTOCOL context, so it will recognize "TCP" as TCP token, not as a string with a value of
  263. "TCP". The parser will receive that token and match the line (2), which creates an appropriate
  264. representation that will be used as the rule's value ($$). Finally, the parser will unroll back to
  265. ncr_protocol rule and execute the code in lines (3) and (4). Line (3) picks the value set up in
  266. line (2) and adds it to the stack of values. Finally, line (4) tells the lexer that we finished the
  267. NCR protocol parsing and it can go back to whatever state it was before.
  268. @section parserBisonStack Generating the Element Tree in Bison
  269. The bison parser keeps matching rules until it reaches the end of input file. During that process,
  270. the code needs to build a hierarchy (a tree) of inter-connected Element objects that represents the
  271. parsed text. @ref isc::data::Element has a complex structure that defines parent-child relation
  272. differently depending on the type of parent (ae.g. a map and a list refer to their children in
  273. different ways). This requires the code to be aware of the parent content. In general, every time a
  274. new scope (an opening curly bracket in input text) is encountered, the code pushes new Element to
  275. the stack (see @ref isc::dhcp::Parser6Context::stack_) and every time the scope closes (a closing
  276. curly bracket in input text) the element is removed from the stack. With this approach, we always
  277. have access to the parent element as it's the last element on the stack. For example, when parsing
  278. preferred-lifetime, the code does the following:
  279. @code
  280. preferred_lifetime: PREFERRED_LIFETIME COLON INTEGER {
  281. ElementPtr prf(new IntElement($3, ctx.loc2pos(@3)));
  282. ctx.stack_.back()->set("preferred-lifetime", prf);
  283. }
  284. @endcode
  285. The first line creates an instance of IntElement with a value of the token. The second line adds it
  286. to the current map (current = the last on the stack). This approach has a very nice property of
  287. being generic. This rule can be referenced from both global and subnet scope (and possibly other
  288. scopes as well) and the code will add the IntElement object to whatever is last on the stack, be it
  289. global, subnet or perhaps even something else (maybe one day we will allow preferred lifetime to be
  290. defined on a per pool or per host basis?).
  291. @section parserSubgrammar Parsing a Partial Configuration
  292. All the explanations so far assumed that we're operating in a default case of receiving the
  293. configuration as a whole. That is the case during startup and reconfiguration. However, both DHCPv4
  294. and DHCPv6 support certain cases when the input text is not the whole configuration, but rather
  295. certain parts of it. There are several examples of such cases. The most common are unit-tests. They
  296. typically don't have the outermost { } or Dhcp6 object, but simply define whatever parameters are
  297. being tested. Second, we have the command channel that will, in the near future, contain parts of
  298. the configuration, depending on the command. For example, "add-reservation" will contain a host
  299. reservation only.
  300. Bison by default does not support multiple start rules, but there's a trick that can provide such a
  301. capability. The trick assumes that the starting rule may allow one of the artificial tokens that
  302. represent the scope expected. For example, when called from the "add-reservation" command, the
  303. artificial token will be SUB_RESERVATION and it will trigger the parser to bypass the global braces
  304. { and } and the "Dhcp6" token and jump immediately to the sub_reservation.
  305. This trick is also implemented in the lexer. A flag called start_token_flag, when initially set to
  306. true, will cause the lexer to emit an artificial token once, before parsing any input whatsoever.
  307. This optional feature can be skipped altogether if you don't plan to parse parts of the
  308. configuration.
  309. @section parserBisonExtend Extending the Grammar
  310. Adding new parameters to existing parsers is very easy once you get hold of the concept of what the
  311. grammar rules represent. The first step is to understand where the parameter is to be
  312. allowed. Typically a new parameter is allowed in one scope and only over time is it added to other
  313. scopes. Recently support for a 4o6-interface-id parameter has been added. That is a parameter that
  314. can be defined in a subnet and takes a string argument. You can see the actual change conducted in
  315. this commit: (https://github.com/isc-projects/kea/commit/9fccdbf54c4611dc10111ad8ff96d36cad59e1d6).
  316. Here's the complete set of changes that were necessary.
  317. 1. Define a new token in dhcp6_parser.yy:
  318. @code
  319. SUBNET_4O6_INTERFACE_ID "4o6-interface-id"
  320. @endcode
  321. This defines a token called SUBNET_4O6_INTERFACE_ID that, when it needs to
  322. be printed, e.g. in an error message, will be represented as "4o6-interface-id".
  323. 2. Tell the lexer how to recognize the new parameter:
  324. @code
  325. \"4o6-interface-id\" {
  326. switch(driver.ctx_) {
  327. case isc::dhcp::Parser4Context::SUBNET4:
  328. return isc::dhcp::Dhcp4Parser::make_SUBNET_4O6_INTERFACE_ID(driver.loc_);
  329. default:
  330. return isc::dhcp::Dhcp4Parser::make_STRING("4o6-interface-id", driver.loc_);
  331. }
  332. }
  333. @endcode
  334. It tells the parser that when in Subnet4 context, an incoming "4o6-interface-id" string should be
  335. represented as the SUBNET_4O6_INTERFACE_ID token. In any other context, it should be represented
  336. as a string.
  337. 3. Add the rule that will define the value. A user is expected to add something like
  338. @code
  339. "4o6-interface-id": "whatever"
  340. @endcode
  341. The rule to match this and similar statements looks as follows:
  342. @code
  343. subnet_4o6_interface_id: SUBNET_4O6_INTERFACE_ID {
  344. ctx.enter(ctx.NO_KEYWORD);
  345. } COLON STRING {
  346. ElementPtr iface(new StringElement($4, ctx.loc2pos(@4)));
  347. ctx.stack_.back()->set("4o6-interface-id", iface);
  348. ctx.leave();
  349. };
  350. @endcode
  351. Here's a good example of the context use. We have no idea what sort of interface-id the user will
  352. use. Typically that will be an integer, but it may be something weird that happens to match our
  353. reserved keywords. Therefore we switch to no keyword context. This tells the lexer to interpret
  354. everything as string, integer or float.
  355. 4. Finally, extend the existing subnet4_param that defines all allowed parameters
  356. in the Subnet4 scope to also cover our new parameter (the new line marked with *):
  357. @code
  358. subnet4_param: valid_lifetime
  359. | renew_timer
  360. | rebind_timer
  361. | option_data_list
  362. | pools_list
  363. | subnet
  364. | interface
  365. | interface_id
  366. | id
  367. | rapid_commit
  368. | client_class
  369. | reservations
  370. | reservation_mode
  371. | relay
  372. | match_client_id
  373. | next_server
  374. | subnet_4o6_interface
  375. | subnet_4o6_interface_id (*)
  376. | subnet_4o6_subnet
  377. | unknown_map_entry
  378. ;
  379. @endcode
  380. 5. Regenerate the flex/bison files by typing "make parser".
  381. 6. Run the unit-tests that you wrote before you touched any of the bison stuff. You did write them
  382. in advance, right?
  383. */