bison.dox 23 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483
  1. // Copyright (C) 2017 Internet Systems Consortium, Inc. ("ISC")
  2. //
  3. // This Source Code Form is subject to the terms of the Mozilla Public
  4. // License, v. 2.0. If a copy of the MPL was not distributed with this
  5. // file, You can obtain one at http://mozilla.org/MPL/2.0/.
  6. /**
  7. @page parser Flex/Bison parsers
  8. @section parserIntro Parser background
  9. Kea's data format of choice is JSON (https://tools.ietf.org/html/rfc7159), which
  10. is used in configuration files, in the command channel and also when
  11. communicating between DHCP servers and DHCP-DDNS component. It is almost certain
  12. it will be used as the data format for any new features.
  13. Historically, Kea used @ref isc::data::Element::fromJSON and @ref
  14. isc::data::Element::fromJSONFile methods to parse received data that is expected
  15. to be in JSON syntax. This in-house parser was developed back in the early BIND10
  16. days. Its two main advantages were that it didn't have any external dependencies
  17. and that it was already available in the source tree when the Kea project
  18. started. On the other hand, it was very difficult to modify (several attempts to
  19. implement more robust comments had failed) and not well implemented. Also, it
  20. was pure JSON parser, so it accepted anything as long as the content was correct
  21. JSON. This has led to other problems - the syntactic checks were conducted much
  22. later, when some of the information (e.g. line numbers) was no longer
  23. available. To print meaningful error messages for example, we had to develop a
  24. way to store filename,line and column information. This on the other hand, led
  25. to duplication. Anyway, this part of the processing is something we can refer to
  26. as phase 1: get input string, parse it and generate a tree of @ref
  27. isc::data::Element objects.
  28. That Element tree was then processed by set of dedicated parsers. Each parser
  29. was able to handle its own context, e.g. global, subnet list, subnet, pool
  30. etc. This step took the tree generated in the earlier step, parsed it and
  31. generated output configuration (e.g. @ref isc::dhcp::SrvConfig) or dynamic
  32. structures (e.g. isc::data::Host). There were a large number of parser objects
  33. derived from @ref isc::dhcp::DhcpConfigParser) instantiated for each scope and
  34. instance of data (e.g. to parse 1000 host reservation entries a thousand of
  35. dedicated parsers were created). For convenience, this step is called phase 2.
  36. Other issues with the old parsers are discussed here: @ref dhcpv6ConfigParserBison
  37. (this section is focused on DHCPv6, but the same issues affected DHCPv4 and D2)
  38. and here: http://kea.isc.org/wiki/SimpleParser.
  39. @section parserBisonIntro Flex/Bison based parser
  40. To solve the issue of phase 1 mentioned earlier, a new parser has been developed
  41. that is based on flex and bison tools. The following text uses DHCPv6 as an
  42. example, but the same principle applies to DHCPv4 and D2 and CA will likely to
  43. follow. The new parser consists of two core elements with a wrapper around them
  44. (the following description is slightly oversimplified to convey the intent, more
  45. detailed description is available in the following sections):
  46. -# Flex lexer (src/bin/dhcp6/dhcp6_lexer.ll) that is essentially a set of
  47. regular expressions with C++ code that creates new tokens that represent whatever
  48. was just parsed. This lexer will be called iteratively by bison until the whole
  49. input text is parsed or an error is encountered. For example, a snippet of the
  50. code could look like this:
  51. @code
  52. \"socket-type\" {
  53. return isc::dhcp::Dhcp6Parser::make_SOCKET_TYPE(driver.loc_);
  54. }
  55. @endcode
  56. This tells the flex that if encounters "socket-type" (quoted), then it should
  57. create a token SOCKET_TYPE and pass to it its current location (that's the
  58. file name, line and column number).
  59. -# Bison grammar (src/bin/dhcp6/dhcp6_parser.yy) that defines the syntax.
  60. Grammar and syntax are perhaps fancy words, but they simply define what is
  61. allowed and where. Bison grammar starts with a list of tokens. Those tokens
  62. are defined only by name ("here's the list of possible tokens that could
  63. appear"). What constitutes a token is actually defined in the lexer. The
  64. grammar define how the incoming tokens are expected to fall into their
  65. places together. Let's take an example of the following input text:
  66. @code
  67. {
  68. "Dhcp6":
  69. {
  70. "renew-timer": 100
  71. }
  72. }
  73. @endcode
  74. this code would return the following sentence of tokens: LCURLY_BRACKET,
  75. DHCP6, COLON, LCURLY_BRACKET, RENEW_TIMER, COLON, INTEGER
  76. (a token with a value of 100), RCURLY_BRACKET, RCURLY_BRACKET, END
  77. -# Parser context. As there is some information that needs to be passed between
  78. parser and lexer, @ref isc::dhcp::Parser6Context is a convenience wrapper
  79. around those two bundled together. It also works as a nice encapsulation,
  80. hiding all the flex/bison details underneath.
  81. @section parserBuild Building flex/bison code
  82. The only input file used by flex is the .ll file. The only input file used by
  83. bison is the .yy file. When making changes to the lexer or parser, only those
  84. two files are edited. When processed, those two tools will generate a number of
  85. .hh and .cc files. The major ones are named the same as their .ll and .yy
  86. counterparts (e.g. dhcp6_lexer.cc, dhcp6_parser.cc and dhcp6_parser.h), but
  87. there's a number of additional files created: location.hh, position.hh and
  88. stack.hh. Those are internal bison headers that are needed for compilation.
  89. To avoid every user to have flex and bison installed, we chose to generate the
  90. files and add them to the Kea repository. To generate those files, do the
  91. following:
  92. @code
  93. ./configure --enable-generate-parser
  94. cd src/bin/dhcp6
  95. make parser
  96. @endcode
  97. Strictly speaking, make parser is not necessary. If you updated .ll or .yy file,
  98. regular make command should pick those changes up. However, since one source
  99. file generates multiple output files and you are likely using multi-process
  100. build (make -j), there may be odd side effects, so I found it more convenient
  101. to explicitly rebuild the files manually by using "make parser".
  102. One problem flex/bison brings is the tool version dependency. If one developer
  103. uses version A of those tools and another developer uses B, then the files
  104. generated may be different and cause unnecessarily large diffs, may cause
  105. coverity/cpp-check issues appear and disappear and cause general unhappiness.
  106. To avoid those problems, we will introduce a requirement to generate flex/bison
  107. files on one dedicated machine. This machine will likely be docs. Currently Ops
  108. is working on installing the necessary versions of flex/bison required, but
  109. for the time being we can use the versions installed in Francis' home directory
  110. (export PATH=/home/fdupont/bin:$PATH).
  111. Note: the above applies only to the code being merged on master. It is probably
  112. ok to generate the files on your development branch with whatever version you
  113. have as long as it is not too old. In particular, the bison version needs to be
  114. at least 3.0.0 and Mac OS has 2.x version installed by default. When reviewing
  115. tickets that have flex/bison changes, please review .ll and .yy files and ignore
  116. the files generated from them. If you really insist, you're welcome to review
  117. them, but in most cases that will be an exercise in futility.
  118. @section parserFlex Flex detailed
  119. Earlier sections described the lexer in a bit over-simplified way. The .ll file
  120. contains a number of additional elements in addition to the regular expressions
  121. and they're not as simple as described.
  122. First, there's a number of sections separated by percent (%) signs. Depending
  123. on which section the code is written in, it may be interpreted by flex, copied
  124. verbatim to output .cc file, copied to output .h file or copied to both.
  125. There is an initial section that defines flex options. These are somewhat
  126. documented, but the docs for it may be a bit cryptic. When developing new
  127. parsers, it's best to start by copying whatever we have for DHCPv6 and tweak as
  128. needed.
  129. Second addition are flex conditions. They're defined with %%x and they define a
  130. state of the lexer. A good example of a state may be comment. Once the lexer
  131. detects that a comment's beginning, it switches to a certain condition (by calling
  132. BEGIN(COMMENT) for example) and the code then ignores whatever follows
  133. (especially strings that look like valid tokens) until the comment is closed
  134. (when it returns to the default condition by calling BEGIN(INITIAL)). This is
  135. something that is not frequently used and the only use cases for it are the
  136. forementioned comments and file inclusions.
  137. Second addition are parser contexts. Let's assume we have a parser that uses
  138. "ip-address" regexp that would return IP_ADDRESS token. Whenever we want to
  139. allow "ip-address", the grammar allows IP_ADDRESS token to appear. When the
  140. lexer is called, it will match the regexp, will generate the IP_ADDRESS token and
  141. the parser will carry out its duty. This works fine as long as you have very
  142. specific grammar that defines everything. Sadly, that's not the case in DHCP as
  143. we have hooks. Hook libraries can have parameters that are defined by third
  144. party developers and they can pick whatever parameter names they want, including
  145. "ip-address". Another example may be Dhcp4 and Dhcp6 configurations defined in a
  146. single file. When parsed by Dhcp6 server, its grammar has a clause that says
  147. "Dhcp4" may contain any generic JSON. However, the lexer will likely find the
  148. "ip-address" string and will say that it's not a part of generic JSON, but a
  149. dedicated IP_ADDRESS token. The parser would then complain and the whole thing
  150. would end up in failure. To solve this problem parser contexts were introduced.
  151. They tell the lexer whether input strings have specific or generic meaning.
  152. For example, when detecting "ip-address" string when parsing host reservation,
  153. the lexer is expected to report IP_ADDRESS token. However, when parsing generic
  154. JSON, it should return STRING with a value of "ip-address". The list of all
  155. contexts is enumerated in @ref isc::dhcp::Parser6Context::ParserContext.
  156. For DHCPv6-specific description of the conflict avoidance, see @ref dhcp6ParserConflicts.
  157. @section parserGrammar Bison grammar
  158. Bison has much better documentation than flex. Its latest version seems to be
  159. available here: https://www.gnu.org/software/bison/manual/ Bison is a LALR(1)
  160. parser, which essentially means that it is able to parse (separate and analyze)
  161. any text that is described by set of rules. You can see the more formal
  162. description here: https://en.wikipedia.org/wiki/LALR_parser, but the plain
  163. English explanation is that you define a set of rules and bison will walk
  164. through input text trying to match the content to those rules. While doing
  165. so, it will be allowed to peek at most one symbol (token) ahead.
  166. Let's take a closer look at the bison grammar we have for DHCPv6. It is defined
  167. in src/bin/dhcp6/dhcp6_parser.yy. Here's a simplified excerpt of it:
  168. @code
  169. // This defines a global Dhcp6 object.
  170. dhcp6_object: DHCP6 COLON LCURLY_BRACKET global_params RCURLY_BRACKET;
  171. // This defines all parameters that may appear in the Dhcp6 object.
  172. // It can either contain a global_param (defined below) or a
  173. // global_params list, followed by a comma followed by a global_param.
  174. // Note this definition is recursive and can expand to a single
  175. // instance of global_param or multiple instances separated by commas.
  176. // This is how bison handles variable number of parameters.
  177. global_params: global_param
  178. | global_params COMMA global_param
  179. ;
  180. // These are the parameters that are allowed in the top-level for
  181. // Dhcp6.
  182. global_param: preferred_lifetime
  183. | valid_lifetime
  184. | renew_timer
  185. | rebind_timer
  186. | subnet6_list
  187. | interfaces_config
  188. | lease_database
  189. | hosts_database
  190. | mac_sources
  191. | relay_supplied_options
  192. | host_reservation_identifiers
  193. | client_classes
  194. | option_data_list
  195. | hooks_libraries
  196. | expired_leases_processing
  197. | server_id
  198. | dhcp4o6_port
  199. ;
  200. renew_timer: RENEW_TIMER COLON INTEGER;
  201. // Many other definitions follow.
  202. @endcode
  203. The code above defines parameters that may appear in the Dhcp6 object
  204. declaration. One important trick to understand is to get the way to handle
  205. variable number of parameters. In bison it is most convenient to present them as
  206. recursive lists (global_params in this example) and allow any number of
  207. global_param instances. This way the grammar is very easily extensible. If one
  208. needs to add a new global parameter, he or she just needs to add it to the
  209. global_param list.
  210. This type of definitions has several levels, each representing logical
  211. structure of the configuration data. We start with global scope, then step
  212. into Dhcp6 object that has Subnet6 list, which has Subnet6 instances,
  213. which has pools list and so on. Each of those is represented as a separate
  214. rule.
  215. The "leaf" rules that don't contain any other rules, must be defined by a
  216. series of tokens. An example of such a rule is renew_timer above. It is defined
  217. as a series of 3 tokens: RENEW_TIMER, COLON and INTEGER.
  218. Speaking of integers, it is worth noting that some tokens can have values. Those
  219. values are defined using %token clause. For example, dhcp6_parser.yy has the
  220. following:
  221. @code
  222. %token <std::string> STRING "constant string"
  223. %token <int64_t> INTEGER "integer"
  224. %token <double> FLOAT "floating point"
  225. %token <bool> BOOLEAN "boolean"
  226. @endcode
  227. The first line says that the token STRING has a type of std::string and when
  228. referring to this token in error messages, it should be printed as "constant
  229. string".
  230. In principle, it is valid to define just the grammar without any corresponding
  231. C++ code to it. Bison will go through the whole input text, match the
  232. rules and will either say the input adhered to the rules (parsing successful)
  233. or not (parsing failed). This may be a useful step when developing new parser,
  234. but it has no practical value. To perform specific actions, bison allows
  235. injecting C++ code at almost any moment. For example we could augment the
  236. renew_timer with some extra code:
  237. @code
  238. renew_timer: RENEW_TIMER {
  239. cout << "renew-timer token detected, so far so good" << endl;
  240. } COLON {
  241. cout << "colon detected!" << endl;
  242. } INTEGER {
  243. uint32_t timer = $3;
  244. cout << "Got the renew-timer value: " << time << endl;
  245. ElementPtr prf(new IntElement($3, ctx.loc2pos(@3)));
  246. ctx.stack_.back()->set("renew-timer", prf);
  247. };
  248. @endcode
  249. This example showcases several important things. First, the ability to insert
  250. code at almost any step is very useful. It's also a powerful debugging tool.
  251. Second, some tokens are valueless (e.g. "renew-timer" when represented as
  252. RENEW_TIMER token has no value), but some have values. In particular, INTEGER
  253. token has value which can be extracted by $ followed by a number that
  254. represents its order, so $3 means "a value of third token in this rule".
  255. Also, some rules may have values. This is not used often, but there are specific
  256. cases when it's convenient. Let's take a look at the following excerpt:
  257. @code
  258. ncr_protocol: NCR_PROTOCOL {
  259. ctx.enter(ctx.NCR_PROTOCOL); (1)
  260. } COLON ncr_protocol_value {
  261. ctx.stack_.back()->set("ncr-protocol", $4); (3)
  262. ctx.leave(); (4)
  263. };
  264. ncr_protocol_value:
  265. UDP { $$ = ElementPtr(new StringElement("UDP", ctx.loc2pos(@1))); }
  266. | TCP { $$ = ElementPtr(new StringElement("TCP", ctx.loc2pos(@1))); } (2)
  267. ;
  268. @endcode
  269. There's a "ncr-protocol" parameter that accepts one of two values: either tcp or
  270. udp. To handle such a case, we first enter the NCR_PROTOCOL context to tell the
  271. lexer that we're in this scope. Lexer will then know that any incoming string of
  272. text that is either "UDP" or "TCP" should be represented as one of TCP or UDP
  273. tokens. Parser knows that after NCR_PROTOCOL there will be a colon followed
  274. by ncr_protocol_value. The rule for ncr_protocol_value says it can be either
  275. TCP token or UDP token. Let's assume the input text has the following:
  276. @code
  277. "ncr-protocol": "TCP"
  278. @endcode
  279. Here's how the parser will handle it. First, it will attempt to match the rule
  280. for ncr_protocol. It will discover the first token is NCR_PROTOCOL. As a result,
  281. it will run the code (1), which will tell lexer to parse incoming tokens
  282. as ncr protocol values. The next token will be COLON. The next one expected
  283. after that is ncr_protocol_value. Lexer is already switched into NCR_PROTOCOL
  284. context, so it will recognize "TCP" as TCP token, not as a string of value of "TCP".
  285. Parser will receive that token and match the line (2). It will create appropriate
  286. representation that will be used a the rule's value ($$). Finally, parser
  287. will unroll back to ncr_protocol rule and execute the code in line (3) and (4).
  288. Line (3) will pick the value set up in line 2 and add it to the stack of
  289. values. Finally, line (4) will tell the lexer that we finished the NCR protocol
  290. parsing and it can go back to whatever state it was before.
  291. @section parserBisonStack Generating Element tree in Bison
  292. Bison parser keeps matching rules until it reaches the end of input file. During
  293. that process the code needs to build a hierarchy (a tree) of inter-connected
  294. Element objects that represents parsed text. @ref isc::data::Element has a
  295. complex structure that defines parent-child relation differently depending on
  296. the type of parent (maps refer to its children differently than lists). This
  297. requires the code to be aware of the parent content. In general, every time a
  298. new scope (an opening curly bracket in input text) is encountered, the code
  299. pushes new Element to the stack (see @ref isc::dhcp::Parser6Context::stack_)
  300. and every time the scope closes (a closing curly bracket in input text) the
  301. element is removed from the stack. With this approach, we always have access
  302. to the parent element as it's the last element on the stack. For example, when
  303. parsing preferred-lifetime, the code does the following:
  304. @code
  305. preferred_lifetime: PREFERRED_LIFETIME COLON INTEGER {
  306. ElementPtr prf(new IntElement($3, ctx.loc2pos(@3))); (1)
  307. ctx.stack_.back()->set("preferred-lifetime", prf); (2)
  308. }
  309. @endcode
  310. The first line creates an instance of IntElement with a value of the token. The
  311. second line adds it to the current map (current = the last on the stack). This
  312. approach has a very nice property of being generic. This rule can be referenced
  313. from global and subnet scope (and possibly other scopes as well) and the code
  314. will add the IntElement object to whatever is last on the stack, be it global,
  315. subnet or perhaps even something else (maybe one day we will allow preferred
  316. lifetime to be defined on a per pool or per host basis?).
  317. @section parserSubgrammar Parsing Partial Configuration
  318. All the explanations so far assumed that we're operating in a default case of
  319. receiving the configuration as a whole. That is the case during startup and
  320. reconfiguration. However, both DHCPv4 and DHCPv6 support certain cases when the
  321. input text is not the whole configuration, but rather certain parts of it. There
  322. are several examples of such cases. The most common are unit-tests. They
  323. typically don't have the outermost { } or Dhcp6 object, but simply define
  324. whatever parameters are being tested. Second, we have command channel that will
  325. in the near future contain parts of the configuration, depending on the
  326. command. For example, add-reservation will contain host reservation only.
  327. Bison by default does not support multiple start rules, but there's a trick
  328. that can provide such capability. The trick assumes that the starting
  329. rule may allow one of artificial tokens that represent the scope that is
  330. expected. For example, when called from add-reservation command, the
  331. artificial token will be SUB_RESERVATION and it will trigger the parser
  332. to bypass the global { }, Dhcp6 and jump immediately to sub_reservation.
  333. This trick is also implemented in the lexer. There's a flag called start_token_flag.
  334. When initially set to true, it will cause the lexer to emit an artificial
  335. token once, before parsing any input whatsoever.
  336. This optional feature can be skipped altogether if you don't plan to parse parts
  337. of the configuration.
  338. @section parserBisonExtend Extending grammar
  339. Adding new parameters to existing parsers is very easy once you get hold of the
  340. concept of what the grammar rules represent. The first step is to understand
  341. where the parameter is to be allowed. Typically a new parameter is allowed
  342. in one scope and only over time it is added in other scopes. Recently a support
  343. for 4o6-interface-id parameter has been added. That's parameter that can
  344. be defined in a subnet and takes a string argument. You can see the actual
  345. change conducted in this commit:
  346. (https://github.com/isc-projects/kea/commit/9fccdbf54c4611dc10111ad8ff96d36cad59e1d6).
  347. Here's the complete set of necessary changes.
  348. 1. Define a new token in dhcp6_parser.yy:
  349. @code
  350. SUBNET_4O6_INTERFACE_ID "4o6-interface-id"
  351. @endcode
  352. This defines a token called SUBNET_4O6_INTERFACE_ID that, when needed to
  353. be printed, will be represented as "4o6-interface-id".
  354. 2. Tell lexer how to recognize the new parameter:
  355. @code
  356. \"4o6-interface-id\" {
  357. switch(driver.ctx_) {
  358. case isc::dhcp::Parser4Context::SUBNET4:
  359. return isc::dhcp::Dhcp4Parser::make_SUBNET_4O6_INTERFACE_ID(driver.loc_);
  360. default:
  361. return isc::dhcp::Dhcp4Parser::make_STRING("4o6-interface-id", driver.loc_);
  362. }
  363. }
  364. @endcode
  365. It tells the parser that when in Subnet4 context, incoming "4o6-interface-id" string
  366. should be represented as SUBNET_4O6_INTERFACE_ID token. In any other context,
  367. it should be represented as a string.
  368. 3. Add the rule that will define the value. A user is expected to add something like
  369. @code
  370. "4o6-interface-id": "whatevah"
  371. @endcode
  372. The rule to match this and similar statements looks as follows:
  373. @code
  374. subnet_4o6_interface_id: SUBNET_4O6_INTERFACE_ID {
  375. ctx.enter(ctx.NO_KEYWORD);
  376. } COLON STRING {
  377. ElementPtr iface(new StringElement($4, ctx.loc2pos(@4)));
  378. ctx.stack_.back()->set("4o6-interface-id", iface);
  379. ctx.leave();
  380. };
  381. @endcode
  382. Here's a good example of the context use. We have no idea what sort of interface-id
  383. the user will use. Typically that will be an integer, but it may be something
  384. weird that happens to match our reserved keywords. Therefore we switch to
  385. no keyword context. This tells the lexer to interpret everything as string,
  386. integer or float.
  387. 4. Finally, extend the existing subnet4_param that defines all allowed parameters
  388. in Subnet4 scope to also cover our new parameter (the new line marked with *):
  389. @code
  390. subnet4_param: valid_lifetime
  391. | renew_timer
  392. | rebind_timer
  393. | option_data_list
  394. | pools_list
  395. | subnet
  396. | interface
  397. | interface_id
  398. | id
  399. | rapid_commit
  400. | client_class
  401. | reservations
  402. | reservation_mode
  403. | relay
  404. | match_client_id
  405. | next_server
  406. | subnet_4o6_interface
  407. | subnet_4o6_interface_id (*)
  408. | subnet_4o6_subnet
  409. | unknown_map_entry
  410. ;
  411. @endcode
  412. 5. Regenerate flex/bison files by typing make parser.
  413. 6. Run unit-tests that you wrote before touch any bison stuff. You did write them
  414. in advance, right?
  415. */