bison.dox 23 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484
  1. // Copyright (C) 2017 Internet Systems Consortium, Inc. ("ISC")
  2. //
  3. // This Source Code Form is subject to the terms of the Mozilla Public
  4. // License, v. 2.0. If a copy of the MPL was not distributed with this
  5. // file, You can obtain one at http://mozilla.org/MPL/2.0/.
  6. /**
  7. @page parser Flex/Bison Parsers
  8. @section parserIntro Parser background
  9. Kea's data format of choice is JSON (defined in https://tools.ietf.org/html/rfc7159), which
  10. is used in configuration files, in the command channel and also when
  11. communicating between the DHCP servers and the DHCP-DDNS component. It is almost certain
  12. to be used as the data format for any new features.
  13. Historically, Kea used the @ref isc::data::Element::fromJSON and @ref
  14. isc::data::Element::fromJSONFile methods to parse data expected
  15. to be in JSON syntax. This in-house parser was developed back in the early days of
  16. Kea when it was part of BIND 10.
  17. Its main advantages were that it didn't have any external dependencies
  18. and that it was already available in the source tree when Kea development
  19. started. On the other hand, it was very difficult to modify (several attempts to
  20. implement more robust comments had failed) and lacked a number of features. Also, it
  21. was a pure JSON parser, so accepted anything as long as the content was correct
  22. JSON. (This caused some problems: for example, the syntactic checks were conducted late in the
  23. parsing process, by which time some of the information, e.g. line numbers, was no longer
  24. available. To print meaningful error messages, the Kea team had to develop a
  25. way to store filename, line and column information. Unfortunately this gave rise to other problems
  26. such as data duplication.) The output from these parsers was a tree of @ref
  27. isc::data::Element objects using shared pointers. This part of the processing we
  28. can refer to as phase 1.
  29. The Element tree was then processed by set of dedicated parsers. Each parser
  30. was able to handle its own context, e.g. global, subnet list, subnet, pool
  31. etc. This step took the tree generated in phase 1, parsed it and
  32. generateda an output configuration (e.g. @ref isc::dhcp::SrvConfig) or dynamic
  33. structures (e.g. isc::data::Host). During this stage, a large number of parser objects
  34. derived from @ref isc::dhcp::DhcpConfigParser could be instantiated for each scope and
  35. instance of data (e.g. to parse 1000 host reservation entries a thousand
  36. dedicated parsers were created). For convenience, this step is called phase 2.
  37. Other issues with the old parsers are discussed here: @ref dhcpv6ConfigParserBison
  38. (this section is focused on DHCPv6, but the same issues affected DHCPv4 and D2)
  39. and here: http://kea.isc.org/wiki/SimpleParser.
  40. @section parserBisonIntro Flex/Bison Based Parser
  41. To solve the issue of phase 1 mentioned earlier, a new parser has been developed
  42. that is based on the "flex and "bison" tools. The following text uses DHCPv6 as an
  43. example, but the same principle applies to DHCPv4 and D2; CA will likely to
  44. follow. The new parser consists of two core elements with a wrapper around them.
  45. The following descriptions are slightly oversimplified in order to convey the intent;
  46. a more detailed description is available in subsequent sections.
  47. -# Flex lexical analyzer (src/bin/dhcp6/dhcp6_lexer.ll): this is essentially a set of
  48. regular expressions and C++ code that creates new tokens that represent whatever
  49. was just parsed. This lexical analyzer (lexer) will be called iteratively by bison until the whole
  50. input text is parsed or an error is encountered. For example, a snippet of the
  51. code might look like this:
  52. @code
  53. \"socket-type\" {
  54. return isc::dhcp::Dhcp6Parser::make_SOCKET_TYPE(driver.loc_);
  55. }
  56. @endcode
  57. This tells the flex that if encounters "socket-type" (quoted), then it should
  58. create a token SOCKET_TYPE and pass to it its current location (that's the
  59. file name, line and column numbers).
  60. -# Bison grammar (src/bin/dhcp6/dhcp6_parser.yy): the module that defines the syntax.
  61. Grammar and syntax are perhaps fancy words, but they simply define what is
  62. allowed and where. Bison grammar starts with a list of tokens. Those tokens
  63. are defined only by name ("here's the list of possible tokens that could
  64. appear"). What constitutes a token is actually defined in the lexer. The
  65. grammar define how the incoming tokens are expected to fall into their
  66. places together. Let's take an example of the following input text:
  67. @code
  68. {
  69. "Dhcp6":
  70. {
  71. "renew-timer": 100
  72. }
  73. }
  74. @endcode
  75. The lexer would generate the following sequence of tokens: LCURLY_BRACKET,
  76. DHCP6, COLON, LCURLY_BRACKET, RENEW_TIMER, COLON, INTEGER
  77. (a token with a value of 100), RCURLY_BRACKET, RCURLY_BRACKET, END. The
  78. bison grammar recognises that the sequence forms a valid sentence and that
  79. there are no errors and act upon it. (Whereas if the left and
  80. right braces in the above example were exchanged, the bison
  81. module would identify the sequence as syntactically incorrect.)
  82. -# Parser context. As there is some information that needs to be passed between
  83. parser and lexer, @ref isc::dhcp::Parser6Context is a convenience wrapper
  84. around those two bundled together. It also works as a nice encapsulation,
  85. hiding all the flex/bison details underneath.
  86. @section parserBuild Building Flex/Bison Code
  87. The only input file used by flex is the .ll file and the only input file used by
  88. bison is the .yy file. When making changes to the lexer or parser, only those
  89. two files are edited. When processed, the two tools generate a number of
  90. .h, .hh and .cc files. The major ones have the same name as their .ll and .yy
  91. counterparts (e.g. dhcp6_lexer.cc, dhcp6_parser.cc and dhcp6_parser.h etc.), but
  92. a number of additional files are also created: location.hh, position.hh and
  93. stack.hh. Those are internal bison headers that are needed for compilation.
  94. To avoid the need for every user to have flex and bison installed, the output files
  95. are generated when the .ll or .yy files are altered and are stored in the
  96. Kea repository. To generate those files, issue the following sequence of
  97. commands from the top-level Kea directory:
  98. @code
  99. ./configure --enable-generate-parser
  100. cd src/bin/dhcp6
  101. make parser
  102. @endcode
  103. Strictly speaking, the comment "make parser" is not necessary. If you updated
  104. the .ll or .yy file, the
  105. regular "make" command should pick those changes up. However, since one source
  106. file generates multiple output files and you are likely to be using a multi-process
  107. build (by specifying the "-j" switch on the "make" command), there may be odd side effects:
  108. explicitly rebuilding the files manually by using "make parser" avoids any trouble.
  109. One problem brought on by use of flex/bison is tool version dependency. If one developer
  110. uses version A of those tools and another developer uses B, the files
  111. generated by the different version may be significantly different. This causes
  112. all sorts of problems, e.g. coverity/cpp-check issues may appear and disappear:
  113. in short, it can cause all sorts of general unhappiness.
  114. To avoid those problems, the Kea team generates the flex/bison
  115. files on a dedicated machine.
  116. @section parserFlex Flex Detailed
  117. Earlier sections described the lexer in a bit of an over-simplified way. The .ll file
  118. contains a number of elements in addition to the regular expressions
  119. and they're not as simple as was described.
  120. The file starts with a number of sections separated by percent (%) signs. Depending
  121. on which section code is written in, it may be interpreted by flex, copied
  122. verbatim to the output .cc file, copied to the output .h file or copied to both.
  123. There is an initial section that defines flex options. These are somewhat
  124. documented, but the documentation for it may be a bit cryptic. When developing new
  125. parsers, it's best to start by copying whatever we have for DHCPv6 and tweak as
  126. needed.
  127. Next comes the flex conditions. They are defined with %%x and they define a
  128. state of the lexer. A good example of a state may be comment. Once the lexer
  129. detects that a comment's beginning, it switches to a certain condition (by calling
  130. BEGIN(COMMENT) for example) and the code then ignores whatever follows
  131. (especially strings that look like valid tokens) until the comment is closed
  132. (when it returns to the default condition by calling BEGIN(INITIAL)). This is
  133. something that is not frequently used and the only use cases for it are the
  134. forementioned comments and file inclusions.
  135. After this come the syntactic contexts. Let's assume we have a parser that uses an
  136. "ip-address" regular expression (regexp) that would return the IP_ADDRESS token. Whenever we want to
  137. allow "ip-address", the grammar allows the IP_ADDRESS token to appear. When the
  138. lexer is called, it will match the regexp, generate the IP_ADDRESS token and
  139. the parser will carry out its duty. This works fine as long as you have very
  140. specific grammar that defines everything. Sadly, that's not the case in DHCP as
  141. we have hooks. Hook libraries can have parameters that are defined by third
  142. party developers and they can pick whatever parameter names they want, including
  143. "ip-address". Another example could be Dhcp4 and Dhcp6 configurations defined in a
  144. single file. The grammar defining "Dhcp6" main contain a clause that says
  145. "Dhcp4" may contain any generic JSON. However, the lexer may find the
  146. "ip-address" string in the "Dhcp4" configuration and will say that it's not a part of generic JSON, but a
  147. dedicated IP_ADDRESS token instead. The parser will then complain and the whole thing
  148. would end up in failure. It was to solve this problem that syntactic contexts were introduced.
  149. They tell the lexer whether input strings have specific or generic meaning.
  150. For example, when parsing host reservations,
  151. the lexer is expected to report the IP_ADDRESS token if "ip-address" is detected. However, when parsing generic
  152. JSON, upon encountering "ip-address" it should return a STRING with a value of "ip-address". The list of all
  153. contexts is enumerated in @ref isc::dhcp::Parser6Context::ParserContext.
  154. For a DHCPv6-specific description of the conflict avoidance, see @ref dhcp6ParserConflicts.
  155. @section parserGrammar Bison Grammar
  156. Bison has much better documentation than flex. Its latest version seems to be
  157. available here: https://www.gnu.org/software/bison/manual. Bison is a LALR(1)
  158. parser, which essentially means that it is able to parse (separate and analyze)
  159. any text that is described by set of rules. You can see the more formal
  160. description here: https://en.wikipedia.org/wiki/LALR_parser, but the plain
  161. English explanation is that you define a set of rules and bison will walk
  162. through input text trying to match the content to those rules. While doing
  163. so, it will be allowed to peek at most one symbol (token) ahead.
  164. As an example, let's take a closer look at the bison grammar we have for DHCPv6. It is defined
  165. in src/bin/dhcp6/dhcp6_parser.yy. Here's a simplified excerpt:
  166. @code
  167. // This defines a global Dhcp6 object.
  168. dhcp6_object: DHCP6 COLON LCURLY_BRACKET global_params RCURLY_BRACKET;
  169. // This defines all parameters that may appear in the Dhcp6 object.
  170. // It can either contain a global_param (defined below) or a
  171. // global_params list, followed by a comma followed by a global_param.
  172. // Note this definition is recursive and can expand to a single
  173. // instance of global_param or multiple instances separated by commas.
  174. // This is how bison handles variable number of parameters.
  175. global_params: global_param
  176. | global_params COMMA global_param
  177. ;
  178. // These are the parameters that are allowed in the top-level for
  179. // Dhcp6.
  180. global_param: preferred_lifetime
  181. | valid_lifetime
  182. | renew_timer
  183. | rebind_timer
  184. | subnet6_list
  185. | interfaces_config
  186. | lease_database
  187. | hosts_database
  188. | mac_sources
  189. | relay_supplied_options
  190. | host_reservation_identifiers
  191. | client_classes
  192. | option_data_list
  193. | hooks_libraries
  194. | expired_leases_processing
  195. | server_id
  196. | dhcp4o6_port
  197. ;
  198. renew_timer: RENEW_TIMER COLON INTEGER;
  199. // Many other definitions follow.
  200. @endcode
  201. The code above defines parameters that may appear in the Dhcp6 object
  202. declaration. One important trick to understand is understand the way to handle
  203. variable number of parameters. In bison it is most convenient to present them as
  204. recursive lists: in this example, global_params defined in a way that allows any number of
  205. global_param instances allowing the grammar to be easily extensible. If one
  206. needs to add a new global parameter, just add it to the
  207. global_param list.
  208. This type of definition has several levels, each representing logical
  209. structure of the configuration data. We start with global scope, then step
  210. into a Dhcp6 object that has a Subnet6 list, which in turn has Subnet6 instances,
  211. each of which has pools list and so on. Each level is represented as a separate
  212. rule.
  213. The "leaf" rules (that don't contain any other rules) must be defined by a
  214. series of tokens. An example of such a rule is renew_timer, above. It is defined
  215. as a series of 3 tokens: RENEW_TIMER, COLON and INTEGER.
  216. Speaking of integers, it is worth noting that some tokens can have values. Those
  217. values are defined using %token clause. For example, dhcp6_parser.yy contains the
  218. following:
  219. @code
  220. %token <std::string> STRING "constant string"
  221. %token <int64_t> INTEGER "integer"
  222. %token <double> FLOAT "floating point"
  223. %token <bool> BOOLEAN "boolean"
  224. @endcode
  225. The first line says that the token STRING has a type of std::string and when
  226. referring to this token in error messages, it should be printed as "constant
  227. string".
  228. In principle, it is valid to define just the grammar without any corresponding
  229. C++ code to it. Bison will go through the whole input text, match the
  230. rules and will either say the input adhered to the rules (parsing successful)
  231. or not (parsing failed). This may be a useful step when developing new parser,
  232. but it has no practical value. To perform specific actions, bison allows
  233. the injection of C++ code at almost any poing. For example we could augment the
  234. parsing of renew_timer with some extra code:
  235. @code
  236. renew_timer: RENEW_TIMER {
  237. cout << "renew-timer token detected, so far so good" << endl;
  238. } COLON {
  239. cout << "colon detected!" << endl;
  240. } INTEGER {
  241. uint32_t timer = $3;
  242. cout << "Got the renew-timer value: " << timer << endl;
  243. ElementPtr prf(new IntElement($3, ctx.loc2pos(@3)));
  244. ctx.stack_.back()->set("renew-timer", prf);
  245. };
  246. @endcode
  247. This example showcases several important things. First, the ability to insert
  248. code at almost any step is very useful. It's also a powerful debugging tool.
  249. Second, some tokens are valueless (e.g. "renew-timer" when represented as the
  250. RENEW_TIMER token has no value), but some have values. In particular, the INTEGER
  251. token has value which can be extracted by $ followed by a number that
  252. represents its order, so $3 means "a value of third token or action
  253. in this rule".
  254. Also, some rules may have values. This is not used often, but there are specific
  255. cases when it's convenient. Let's take a look at the following excerpt from
  256. dhcp6_parser.yy:
  257. @code
  258. ncr_protocol: NCR_PROTOCOL {
  259. ctx.enter(ctx.NCR_PROTOCOL); (1)
  260. } COLON ncr_protocol_value {
  261. ctx.stack_.back()->set("ncr-protocol", $4); (3)
  262. ctx.leave(); (4)
  263. };
  264. ncr_protocol_value:
  265. UDP { $$ = ElementPtr(new StringElement("UDP", ctx.loc2pos(@1))); }
  266. | TCP { $$ = ElementPtr(new StringElement("TCP", ctx.loc2pos(@1))); } (2)
  267. ;
  268. @endcode
  269. (The numbers in brackets at the end of some lines do not appear in the code;
  270. they are used identify the statements in the following discussion.)
  271. The "ncr-protocol" parameter accepts one of two values: either tcp or
  272. udp. To handle such a case, we first enter the NCR_PROTOCOL context to tell the
  273. lexer that we're in this scope. The lexer will then know that any incoming string of
  274. text that is either "UDP" or "TCP" should be represented as one of the TCP or UDP
  275. tokens. The parser knows that after NCR_PROTOCOL there will be a colon followed
  276. by an ncr_protocol_value. The rule for ncr_protocol_value says it can be either the
  277. TCP token or the UDP token. Let's assume the input text is:
  278. @code
  279. "ncr-protocol": "TCP"
  280. @endcode
  281. Here's how the parser will handle it. First, it will attempt to match the rule
  282. for ncr_protocol. It will discover the first token is NCR_PROTOCOL. As a result,
  283. it will run the code (1), which will tell lexer to parse incoming tokens
  284. as ncr protocol values. The next token is expected to be COLON and the one
  285. after that the ncr_protocol_value. The lexer has already been switched into the NCR_PROTOCOL
  286. context, so it will recognize "TCP" as TCP token, not as a string with a value of "TCP".
  287. The parser will receive that token and match the line (2), which creates an appropriate
  288. representation that will be used as the rule's value ($$). Finally, the parser
  289. will unroll back to ncr_protocol rule and execute the code in lines (3) and (4).
  290. Line (3) picks the value set up in line (2) and adds it to the stack of
  291. values. Finally, line (4) tells the lexer that we finished the NCR protocol
  292. parsing and it can go back to whatever state it was before.
  293. @section parserBisonStack Generating the Element Tree in Bison
  294. The bison parser keeps matching rules until it reaches the end of input file. During
  295. that process, the code needs to build a hierarchy (a tree) of inter-connected
  296. Element objects that represents the parsed text. @ref isc::data::Element has a
  297. complex structure that defines parent-child relation differently depending on
  298. the type of parent (ae.g. a map and a list refer to their children in different ways). This
  299. requires the code to be aware of the parent content. In general, every time a
  300. new scope (an opening curly bracket in input text) is encountered, the code
  301. pushes new Element to the stack (see @ref isc::dhcp::Parser6Context::stack_)
  302. and every time the scope closes (a closing curly bracket in input text) the
  303. element is removed from the stack. With this approach, we always have access
  304. to the parent element as it's the last element on the stack. For example, when
  305. parsing preferred-lifetime, the code does the following:
  306. @code
  307. preferred_lifetime: PREFERRED_LIFETIME COLON INTEGER {
  308. ElementPtr prf(new IntElement($3, ctx.loc2pos(@3)));
  309. ctx.stack_.back()->set("preferred-lifetime", prf);
  310. }
  311. @endcode
  312. The first line creates an instance of IntElement with a value of the token. The
  313. second line adds it to the current map (current = the last on the stack). This
  314. approach has a very nice property of being generic. This rule can be referenced
  315. from both global and subnet scope (and possibly other scopes as well) and the code
  316. will add the IntElement object to whatever is last on the stack, be it global,
  317. subnet or perhaps even something else (maybe one day we will allow preferred
  318. lifetime to be defined on a per pool or per host basis?).
  319. @section parserSubgrammar Parsing a Partial Configuration
  320. All the explanations so far assumed that we're operating in a default case of
  321. receiving the configuration as a whole. That is the case during startup and
  322. reconfiguration. However, both DHCPv4 and DHCPv6 support certain cases when the
  323. input text is not the whole configuration, but rather certain parts of it. There
  324. are several examples of such cases. The most common are unit-tests. They
  325. typically don't have the outermost { } or Dhcp6 object, but simply define
  326. whatever parameters are being tested. Second, we have the command channel that will,
  327. in the near future, contain parts of the configuration, depending on the
  328. command. For example, "add-reservation" will contain a host reservation only.
  329. Bison by default does not support multiple start rules, but there's a trick
  330. that can provide such a capability. The trick assumes that the starting
  331. rule may allow one of the artificial tokens that represent the scope
  332. expected. For example, when called from the "add-reservation" command, the
  333. artificial token will be SUB_RESERVATION and it will trigger the parser
  334. to bypass the global braces { and } and the "Dhcp6" token and jump immediately to the sub_reservation.
  335. This trick is also implemented in the lexer. A flag called start_token_flag,
  336. when initially set to true, will cause the lexer to emit an artificial
  337. token once, before parsing any input whatsoever.
  338. This optional feature can be skipped altogether if you don't plan to parse parts
  339. of the configuration.
  340. @section parserBisonExtend Extending the Grammar
  341. Adding new parameters to existing parsers is very easy once you get hold of the
  342. concept of what the grammar rules represent. The first step is to understand
  343. where the parameter is to be allowed. Typically a new parameter is allowed
  344. in one scope and only over time is it added to other scopes. Recently support
  345. for a 4o6-interface-id parameter has been added. That is a parameter that can
  346. be defined in a subnet and takes a string argument. You can see the actual
  347. change conducted in this commit:
  348. (https://github.com/isc-projects/kea/commit/9fccdbf54c4611dc10111ad8ff96d36cad59e1d6).
  349. Here's the complete set of changes that were necessary.
  350. 1. Define a new token in dhcp6_parser.yy:
  351. @code
  352. SUBNET_4O6_INTERFACE_ID "4o6-interface-id"
  353. @endcode
  354. This defines a token called SUBNET_4O6_INTERFACE_ID that, when it needs to
  355. be printed, e.g. in an error message, will be represented as "4o6-interface-id".
  356. 2. Tell the lexer how to recognize the new parameter:
  357. @code
  358. \"4o6-interface-id\" {
  359. switch(driver.ctx_) {
  360. case isc::dhcp::Parser4Context::SUBNET4:
  361. return isc::dhcp::Dhcp4Parser::make_SUBNET_4O6_INTERFACE_ID(driver.loc_);
  362. default:
  363. return isc::dhcp::Dhcp4Parser::make_STRING("4o6-interface-id", driver.loc_);
  364. }
  365. }
  366. @endcode
  367. It tells the parser that when in Subnet4 context, an incoming "4o6-interface-id" string
  368. should be represented as the SUBNET_4O6_INTERFACE_ID token. In any other context,
  369. it should be represented as a string.
  370. 3. Add the rule that will define the value. A user is expected to add something like
  371. @code
  372. "4o6-interface-id": "whatever"
  373. @endcode
  374. The rule to match this and similar statements looks as follows:
  375. @code
  376. subnet_4o6_interface_id: SUBNET_4O6_INTERFACE_ID {
  377. ctx.enter(ctx.NO_KEYWORD);
  378. } COLON STRING {
  379. ElementPtr iface(new StringElement($4, ctx.loc2pos(@4)));
  380. ctx.stack_.back()->set("4o6-interface-id", iface);
  381. ctx.leave();
  382. };
  383. @endcode
  384. Here's a good example of the context use. We have no idea what sort of interface-id
  385. the user will use. Typically that will be an integer, but it may be something
  386. weird that happens to match our reserved keywords. Therefore we switch to
  387. no keyword context. This tells the lexer to interpret everything as string,
  388. integer or float.
  389. 4. Finally, extend the existing subnet4_param that defines all allowed parameters
  390. in the Subnet4 scope to also cover our new parameter (the new line marked with *):
  391. @code
  392. subnet4_param: valid_lifetime
  393. | renew_timer
  394. | rebind_timer
  395. | option_data_list
  396. | pools_list
  397. | subnet
  398. | interface
  399. | interface_id
  400. | id
  401. | rapid_commit
  402. | client_class
  403. | reservations
  404. | reservation_mode
  405. | relay
  406. | match_client_id
  407. | next_server
  408. | subnet_4o6_interface
  409. | subnet_4o6_interface_id (*)
  410. | subnet_4o6_subnet
  411. | unknown_map_entry
  412. ;
  413. @endcode
  414. 5. Regenerate the flex/bison files by typing "make parser".
  415. 6. Run the unit-tests that you wrote before you touched any of the bison stuff. You did write them
  416. in advance, right?
  417. */