blob: 20525769e8084d3e81927c3c1b634e5a4b2b4ba5 [file] [log] [blame]
Nigel Tao1b073492020-02-16 22:11:36 +11001// Copyright 2020 The Wuffs Authors.
2//
3// Licensed under the Apache License, Version 2.0 (the "License");
4// you may not use this file except in compliance with the License.
5// You may obtain a copy of the License at
6//
7// https://www.apache.org/licenses/LICENSE-2.0
8//
9// Unless required by applicable law or agreed to in writing, software
10// distributed under the License is distributed on an "AS IS" BASIS,
11// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12// See the License for the specific language governing permissions and
13// limitations under the License.
14
15// ----------------
16
17/*
Nigel Tao0cd2f982020-03-03 23:03:02 +110018jsonptr is a JSON formatter (pretty-printer) that supports the JSON Pointer
Nigel Tao0291a472020-08-13 22:40:10 +100019(RFC 6901) query syntax. It reads UTF-8 JSON from stdin and writes
20canonicalized, formatted UTF-8 JSON to stdout.
Nigel Tao0cd2f982020-03-03 23:03:02 +110021
Nigel Taod60815c2020-03-26 14:32:35 +110022See the "const char* g_usage" string below for details.
Nigel Tao0cd2f982020-03-03 23:03:02 +110023
24----
25
26JSON Pointer (and this program's implementation) is one of many JSON query
27languages and JSON tools, such as jq, jql and JMESPath. This one is relatively
28simple and fewer-featured compared to those others.
29
Nigel Tao0291a472020-08-13 22:40:10 +100030One benefit of simplicity is that this program's JSON and JSON Pointer
Nigel Tao0cd2f982020-03-03 23:03:02 +110031implementations do not dynamically allocate or free memory (yet it does not
32require that the entire input fits in memory at once). They are therefore
33trivially protected against certain bug classes: memory leaks, double-frees and
34use-after-frees.
35
Nigel Tao0291a472020-08-13 22:40:10 +100036The core JSON implementation is also written in the Wuffs programming language
37(and then transpiled to C/C++), which is memory-safe (e.g. array indexing is
38bounds-checked) but also guards against integer arithmetic overflows.
Nigel Tao0cd2f982020-03-03 23:03:02 +110039
Nigel Taofe0cbbd2020-03-05 22:01:30 +110040For defense in depth, on Linux, this program also self-imposes a
41SECCOMP_MODE_STRICT sandbox before reading (or otherwise processing) its input
42or writing its output. Under this sandbox, the only permitted system calls are
43read, write, exit and sigreturn.
44
Nigel Tao0291a472020-08-13 22:40:10 +100045All together, this program aims to safely handle untrusted JSON files without
46fear of security bugs such as remote code execution.
Nigel Tao0cd2f982020-03-03 23:03:02 +110047
48----
Nigel Tao1b073492020-02-16 22:11:36 +110049
Nigel Taoc5b3a9e2020-02-24 11:54:35 +110050As of 2020-02-24, this program passes all 318 "test_parsing" cases from the
51JSON test suite (https://github.com/nst/JSONTestSuite), an appendix to the
52"Parsing JSON is a Minefield" article (http://seriot.ch/parsing_json.php) that
53was first published on 2016-10-26 and updated on 2018-03-30.
54
Nigel Tao0cd2f982020-03-03 23:03:02 +110055After modifying this program, run "build-example.sh example/jsonptr/" and then
56"script/run-json-test-suite.sh" to catch correctness regressions.
57
58----
59
Nigel Taod0b16cb2020-03-14 10:15:54 +110060This program uses Wuffs' JSON decoder at a relatively low level, processing the
61decoder's token-stream output individually. The core loop, in pseudo-code, is
62"for_each_token { handle_token(etc); }", where the handle_token function
Nigel Taod60815c2020-03-26 14:32:35 +110063changes global state (e.g. the `g_depth` and `g_ctx` variables) and prints
Nigel Taod0b16cb2020-03-14 10:15:54 +110064output text based on that state and the token's source text. Notably,
65handle_token is not recursive, even though JSON values can nest.
66
67This approach is centered around JSON tokens. Each JSON 'thing' (e.g. number,
68string, object) comprises one or more JSON tokens.
69
70An alternative, higher-level approach is in the sibling example/jsonfindptrs
71program. Neither approach is better or worse per se, but when studying this
72program, be aware that there are multiple ways to use Wuffs' JSON decoder.
73
74The two programs, jsonfindptrs and jsonptr, also demonstrate different
75trade-offs with regard to JSON object duplicate keys. The JSON spec permits
76different implementations to allow or reject duplicate keys. It is not always
77clear which approach is safer. Rejecting them is certainly unambiguous, and
78security bugs can lurk in ambiguous corners of a file format, if two different
79implementations both silently accept a file but differ on how to interpret it.
80On the other hand, in the worst case, detecting duplicate keys requires O(N)
81memory, where N is the size of the (potentially untrusted) input.
82
83This program (jsonptr) allows duplicate keys and requires only O(1) memory. As
84mentioned above, it doesn't dynamically allocate memory at all, and on Linux,
85it runs in a SECCOMP_MODE_STRICT sandbox.
86
87----
88
Nigel Tao50bfab92020-08-05 11:39:09 +100089To run:
Nigel Tao1b073492020-02-16 22:11:36 +110090
91$CXX jsonptr.cc && ./a.out < ../../test/data/github-tags.json; rm -f a.out
92
93for a C++ compiler $CXX, such as clang++ or g++.
94*/
95
Nigel Tao721190a2020-04-03 22:25:21 +110096#if defined(__cplusplus) && (__cplusplus < 201103L)
97#error "This C++ program requires -std=c++11 or later"
98#endif
99
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100100#include <errno.h>
Nigel Tao01abc842020-03-06 21:42:33 +1100101#include <fcntl.h>
102#include <stdio.h>
Nigel Tao9cc2c252020-02-23 17:05:49 +1100103#include <string.h>
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100104#include <unistd.h>
Nigel Tao1b073492020-02-16 22:11:36 +1100105
106// Wuffs ships as a "single file C library" or "header file library" as per
107// https://github.com/nothings/stb/blob/master/docs/stb_howto.txt
108//
109// To use that single file as a "foo.c"-like implementation, instead of a
110// "foo.h"-like header, #define WUFFS_IMPLEMENTATION before #include'ing or
111// compiling it.
112#define WUFFS_IMPLEMENTATION
113
114// Defining the WUFFS_CONFIG__MODULE* macros are optional, but it lets users of
115// release/c/etc.c whitelist which parts of Wuffs to build. That file contains
116// the entire Wuffs standard library, implementing a variety of codecs and file
117// formats. Without this macro definition, an optimizing compiler or linker may
118// very well discard Wuffs code for unused codecs, but listing the Wuffs
119// modules we use makes that process explicit. Preprocessing means that such
120// code simply isn't compiled.
121#define WUFFS_CONFIG__MODULES
122#define WUFFS_CONFIG__MODULE__BASE
123#define WUFFS_CONFIG__MODULE__JSON
124
125// If building this program in an environment that doesn't easily accommodate
126// relative includes, you can use the script/inline-c-relative-includes.go
127// program to generate a stand-alone C++ file.
128#include "../../release/c/wuffs-unsupported-snapshot.c"
129
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100130#if defined(__linux__)
131#include <linux/prctl.h>
132#include <linux/seccomp.h>
133#include <sys/prctl.h>
134#include <sys/syscall.h>
135#define WUFFS_EXAMPLE_USE_SECCOMP
136#endif
137
Nigel Tao2cf76db2020-02-27 22:42:01 +1100138#define TRY(error_msg) \
139 do { \
140 const char* z = error_msg; \
141 if (z) { \
142 return z; \
143 } \
144 } while (false)
145
Nigel Taod60815c2020-03-26 14:32:35 +1100146static const char* g_eod = "main: end of data";
Nigel Tao2cf76db2020-02-27 22:42:01 +1100147
Nigel Taod60815c2020-03-26 14:32:35 +1100148static const char* g_usage =
Nigel Tao01abc842020-03-06 21:42:33 +1100149 "Usage: jsonptr -flags input.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100150 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100151 "Flags:\n"
Nigel Tao3690e832020-03-12 16:52:26 +1100152 " -c -compact-output\n"
Nigel Tao94440cf2020-04-02 22:28:24 +1100153 " -d=NUM -max-output-depth=NUM\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100154 " -q=STR -query=STR\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000155 " -s=NUM -spaces=NUM\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100156 " -t -tabs\n"
157 " -fail-if-unsandboxed\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000158 " -input-allow-comments\n"
159 " -input-allow-extra-comma\n"
160 " -input-allow-inf-nan-numbers\n"
Nigel Tao21042052020-08-19 23:13:54 +1000161 " -jwcc\n"
162 " -output-comments\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000163 " -output-extra-comma\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000164 " -strict-json-pointer-syntax\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100165 "\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100166 "The input.json filename is optional. If absent, it reads from stdin.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100167 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100168 "----\n"
169 "\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100170 "jsonptr is a JSON formatter (pretty-printer) that supports the JSON\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000171 "Pointer (RFC 6901) query syntax. It reads UTF-8 JSON from stdin and\n"
172 "writes canonicalized, formatted UTF-8 JSON to stdout.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100173 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000174 "Canonicalized means that e.g. \"abc\\u000A\\tx\\u0177z\" is re-written\n"
175 "as \"abc\\n\\txÅ·z\". It does not sort object keys, nor does it reject\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100176 "duplicate keys. Canonicalization does not imply Unicode normalization.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100177 "\n"
178 "Formatted means that arrays' and objects' elements are indented, each\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000179 "on its own line. Configure this with the -c / -compact-output, -s=NUM /\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000180 "-spaces=NUM (for NUM ranging from 0 to 8) and -t / -tabs flags.\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000181 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000182 "The -input-allow-comments flag allows \"/*slash-star*/\" and\n"
183 "\"//slash-slash\" C-style comments within JSON input. Such comments are\n"
Nigel Tao21042052020-08-19 23:13:54 +1000184 "stripped from the output unless -output-comments was also set.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100185 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000186 "The -input-allow-extra-comma flag allows input like \"[1,2,]\", with a\n"
187 "comma after the final element of a JSON list or dictionary.\n"
Nigel Tao3c8589b2020-07-19 21:49:00 +1000188 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000189 "The -input-allow-inf-nan-numbers flag allows non-finite floating point\n"
190 "numbers (infinities and not-a-numbers) within JSON input.\n"
Nigel Tao3c8589b2020-07-19 21:49:00 +1000191 "\n"
Nigel Tao21042052020-08-19 23:13:54 +1000192 "The -output-comments flag copies any input comments to the output. It\n"
193 "has no effect unless -input-allow-comments was also set. Comments look\n"
194 "better after commas than before them, but a closing \"]\" or \"}\" can\n"
195 "occur after arbitrarily many comments, so -output-comments also requires\n"
196 "that one or both of -compact-output and -output-extra-comma be set.\n"
197 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000198 "The -output-extra-comma flag writes output like \"[1,2,]\", with a comma\n"
199 "after the final element of a JSON list or dictionary. Such commas are\n"
200 "non-compliant with the JSON specification but many parsers accept them\n"
201 "and they can produce simpler line-based diffs. This flag is ignored when\n"
202 "-compact-output is set.\n"
Nigel Taof8dfc762020-07-23 23:35:44 +1000203 "\n"
Nigel Tao21042052020-08-19 23:13:54 +1000204 "The -jwcc flag (JSON With Commas and Comments) enables all of:\n"
205 " -input-allow-comments\n"
206 " -input-allow-extra-comma\n"
207 " -output-comments\n"
208 " -output-extra-comma\n"
209 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100210 "----\n"
211 "\n"
212 "The -q=STR or -query=STR flag gives an optional JSON Pointer query, to\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100213 "print a subset of the input. For example, given RFC 6901 section 5's\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100214 "sample input (https://tools.ietf.org/rfc/rfc6901.txt), this command:\n"
215 " jsonptr -query=/foo/1 rfc-6901-json-pointer.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100216 "will print:\n"
217 " \"baz\"\n"
218 "\n"
219 "An absent query is equivalent to the empty query, which identifies the\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100220 "entire input (the root value). Unlike a file system, the \"/\" query\n"
Nigel Taod0b16cb2020-03-14 10:15:54 +1100221 "does not identify the root. Instead, \"\" is the root and \"/\" is the\n"
222 "child (the value in a key-value pair) of the root whose key is the empty\n"
223 "string. Similarly, \"/xyz\" and \"/xyz/\" are two different nodes.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100224 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000225 "If the query found a valid JSON value, this program will return a zero\n"
226 "exit code even if the rest of the input isn't valid JSON. If the query\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100227 "did not find a value, or found an invalid one, this program returns a\n"
228 "non-zero exit code, but may still print partial output to stdout.\n"
229 "\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000230 "The JSON specification (https://json.org/) permits implementations that\n"
231 "allow duplicate keys, as this one does. This JSON Pointer implementation\n"
232 "is also greedy, following the first match for each fragment without\n"
233 "back-tracking. For example, the \"/foo/bar\" query will fail if the root\n"
234 "object has multiple \"foo\" children but the first one doesn't have a\n"
235 "\"bar\" child, even if later ones do.\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100236 "\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000237 "The -strict-json-pointer-syntax flag restricts the -query=STR string to\n"
238 "exactly RFC 6901, with only two escape sequences: \"~0\" and \"~1\" for\n"
239 "\"~\" and \"/\". Without this flag, this program also lets \"~n\" and\n"
240 "\"~r\" escape the New Line and Carriage Return ASCII control characters,\n"
241 "which can work better with line oriented Unix tools that assume exactly\n"
242 "one value (i.e. one JSON Pointer string) per line.\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100243 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100244 "----\n"
245 "\n"
Nigel Tao94440cf2020-04-02 22:28:24 +1100246 "The -d=NUM or -max-output-depth=NUM flag gives the maximum (inclusive)\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000247 "output depth. JSON containers ([] arrays and {} objects) can hold other\n"
248 "containers. When this flag is set, containers at depth NUM are replaced\n"
249 "with \"[…]\" or \"{…}\". A bare -d or -max-output-depth is equivalent to\n"
250 "-d=1. The flag's absence is equivalent to an unlimited output depth.\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100251 "\n"
252 "The -max-output-depth flag only affects the program's output. It doesn't\n"
Nigel Tao0291a472020-08-13 22:40:10 +1000253 "affect whether or not the input is considered valid JSON. The JSON\n"
254 "specification permits implementations to set their own maximum input\n"
255 "depth. This JSON implementation sets it to 1024.\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100256 "\n"
257 "Depth is measured in terms of nested containers. It is unaffected by the\n"
258 "number of spaces or tabs used to indent.\n"
259 "\n"
260 "When both -max-output-depth and -query are set, the output depth is\n"
261 "measured from when the query resolves, not from the input root. The\n"
262 "input depth (measured from the root) is still limited to 1024.\n"
263 "\n"
264 "----\n"
265 "\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100266 "The -fail-if-unsandboxed flag causes the program to exit if it does not\n"
267 "self-impose a sandbox. On Linux, it self-imposes a SECCOMP_MODE_STRICT\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100268 "sandbox, regardless of whether this flag was set.";
Nigel Tao0cd2f982020-03-03 23:03:02 +1100269
Nigel Tao2cf76db2020-02-27 22:42:01 +1100270// ----
271
Nigel Taof3146c22020-03-26 08:47:42 +1100272// Wuffs allows either statically or dynamically allocated work buffers. This
273// program exercises static allocation.
274#define WORK_BUFFER_ARRAY_SIZE \
275 WUFFS_JSON__DECODER_WORKBUF_LEN_MAX_INCL_WORST_CASE
276#if WORK_BUFFER_ARRAY_SIZE > 0
Nigel Taod60815c2020-03-26 14:32:35 +1100277uint8_t g_work_buffer_array[WORK_BUFFER_ARRAY_SIZE];
Nigel Taof3146c22020-03-26 08:47:42 +1100278#else
279// Not all C/C++ compilers support 0-length arrays.
Nigel Taod60815c2020-03-26 14:32:35 +1100280uint8_t g_work_buffer_array[1];
Nigel Taof3146c22020-03-26 08:47:42 +1100281#endif
282
Nigel Taod60815c2020-03-26 14:32:35 +1100283bool g_sandboxed = false;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100284
Nigel Taod60815c2020-03-26 14:32:35 +1100285int g_input_file_descriptor = 0; // A 0 default means stdin.
Nigel Tao01abc842020-03-06 21:42:33 +1100286
Nigel Tao0a0c7d62020-08-18 23:31:27 +1000287#define NEW_LINE_THEN_256_SPACES \
288 "\n " \
289 " " \
290 " " \
291 " "
292#define NEW_LINE_THEN_256_TABS \
293 "\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t" \
294 "\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t" \
295 "\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t" \
296 "\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t" \
297 "\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t" \
298 "\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t" \
299 "\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t"
300
301const char* g_new_line_then_256_indent_bytes;
302uint32_t g_bytes_per_indent_depth;
Nigel Tao107f0ef2020-03-01 21:35:02 +1100303
Nigel Taofdac24a2020-03-06 21:53:08 +1100304#ifndef DST_BUFFER_ARRAY_SIZE
305#define DST_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100306#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100307#ifndef SRC_BUFFER_ARRAY_SIZE
308#define SRC_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100309#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100310#ifndef TOKEN_BUFFER_ARRAY_SIZE
311#define TOKEN_BUFFER_ARRAY_SIZE (4 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100312#endif
313
Nigel Taod60815c2020-03-26 14:32:35 +1100314uint8_t g_dst_array[DST_BUFFER_ARRAY_SIZE];
315uint8_t g_src_array[SRC_BUFFER_ARRAY_SIZE];
316wuffs_base__token g_tok_array[TOKEN_BUFFER_ARRAY_SIZE];
Nigel Tao1b073492020-02-16 22:11:36 +1100317
Nigel Taod60815c2020-03-26 14:32:35 +1100318wuffs_base__io_buffer g_dst;
319wuffs_base__io_buffer g_src;
320wuffs_base__token_buffer g_tok;
Nigel Tao1b073492020-02-16 22:11:36 +1100321
Nigel Tao991bd512020-08-19 09:38:16 +1000322// g_cursor_index is the g_src.data.ptr index between the previous and current
323// token. An invariant is that (g_cursor_index <= g_src.meta.ri).
324size_t g_cursor_index;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100325
Nigel Taod60815c2020-03-26 14:32:35 +1100326uint32_t g_depth;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100327
328enum class context {
329 none,
330 in_list_after_bracket,
331 in_list_after_value,
332 in_dict_after_brace,
333 in_dict_after_key,
334 in_dict_after_value,
Nigel Taod60815c2020-03-26 14:32:35 +1100335} g_ctx;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100336
Nigel Tao0cd2f982020-03-03 23:03:02 +1100337bool //
338in_dict_before_key() {
Nigel Taod60815c2020-03-26 14:32:35 +1100339 return (g_ctx == context::in_dict_after_brace) ||
340 (g_ctx == context::in_dict_after_value);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100341}
342
Nigel Tao21042052020-08-19 23:13:54 +1000343bool g_is_after_comment;
344
Nigel Taod60815c2020-03-26 14:32:35 +1100345uint32_t g_suppress_write_dst;
346bool g_wrote_to_dst;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100347
Nigel Tao0291a472020-08-13 22:40:10 +1000348wuffs_json__decoder g_dec;
Nigel Taoea532452020-07-27 00:03:00 +1000349
Nigel Tao0cd2f982020-03-03 23:03:02 +1100350// ----
351
352// Query is a JSON Pointer query. After initializing with a NUL-terminated C
353// string, its multiple fragments are consumed as the program walks the JSON
354// data from stdin. For example, letting "$" denote a NUL, suppose that we
355// started with a query string of "/apple/banana/12/durian" and are currently
Nigel Taob48ee752020-03-13 09:27:33 +1100356// trying to match the second fragment, "banana", so that Query::m_depth is 2:
Nigel Tao0cd2f982020-03-03 23:03:02 +1100357//
358// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
359// / a p p l e / b a n a n a / 1 2 / d u r i a n $
360// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
361// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100362// m_frag_i m_frag_k
Nigel Tao0cd2f982020-03-03 23:03:02 +1100363//
Nigel Taob48ee752020-03-13 09:27:33 +1100364// The two pointers m_frag_i and m_frag_k (abbreviated as mfi and mfk) are the
365// start (inclusive) and end (exclusive) of the query fragment. They satisfy
366// (mfi <= mfk) and may be equal if the fragment empty (note that "" is a valid
367// JSON object key).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100368//
Nigel Taob48ee752020-03-13 09:27:33 +1100369// The m_frag_j (mfj) pointer moves between these two, or is nullptr. An
370// invariant is that (((mfi <= mfj) && (mfj <= mfk)) || (mfj == nullptr)).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100371//
372// Wuffs' JSON tokenizer can portray a single JSON string as multiple Wuffs
373// tokens, as backslash-escaped values within that JSON string may each get
374// their own token.
375//
Nigel Taob48ee752020-03-13 09:27:33 +1100376// At the start of each object key (a JSON string), mfj is set to mfi.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100377//
Nigel Taob48ee752020-03-13 09:27:33 +1100378// While mfj remains non-nullptr, each token's unescaped contents are then
379// compared to that part of the fragment from mfj to mfk. If it is a prefix
380// (including the case of an exact match), then mfj is advanced by the
381// unescaped length. Otherwise, mfj is set to nullptr.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100382//
383// Comparison accounts for JSON Pointer's escaping notation: "~0" and "~1" in
384// the query (not the JSON value) are unescaped to "~" and "/" respectively.
Nigel Taob48ee752020-03-13 09:27:33 +1100385// "~n" and "~r" are also unescaped to "\n" and "\r". The program is
386// responsible for calling Query::validate (with a strict_json_pointer_syntax
387// argument) before otherwise using this class.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100388//
Nigel Taob48ee752020-03-13 09:27:33 +1100389// The mfj pointer therefore advances from mfi to mfk, or drops out, as we
390// incrementally match the object key with the query fragment. For example, if
391// we have already matched the "ban" of "banana", then we would accept any of
392// an "ana" token, an "a" token or a "\u0061" token, amongst others. They would
393// advance mfj by 3, 1 or 1 bytes respectively.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100394//
Nigel Taob48ee752020-03-13 09:27:33 +1100395// mfj
Nigel Tao0cd2f982020-03-03 23:03:02 +1100396// v
397// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
398// / a p p l e / b a n a n a / 1 2 / d u r i a n $
399// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
400// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100401// mfi mfk
Nigel Tao0cd2f982020-03-03 23:03:02 +1100402//
403// At the end of each object key (or equivalently, at the start of each object
Nigel Taob48ee752020-03-13 09:27:33 +1100404// value), if mfj is non-nullptr and equal to (but not less than) mfk then we
405// have a fragment match: the query fragment equals the object key. If there is
406// a next fragment (in this example, "12") we move the frag_etc pointers to its
407// start and end and increment Query::m_depth. Otherwise, we have matched the
408// complete query, and the upcoming JSON value is the result of that query.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100409//
410// The discussion above centers on object keys. If the query fragment is
411// numeric then it can also match as an array index: the string fragment "12"
412// will match an array's 13th element (starting counting from zero). See RFC
413// 6901 for its precise definition of an "array index" number.
414//
Nigel Taob48ee752020-03-13 09:27:33 +1100415// Array index fragment match is represented by the Query::m_array_index field,
Nigel Tao0cd2f982020-03-03 23:03:02 +1100416// whose type (wuffs_base__result_u64) is a result type. An error result means
417// that the fragment is not an array index. A value result holds the number of
418// list elements remaining. When matching a query fragment in an array (instead
419// of in an object), each element ticks this number down towards zero. At zero,
420// the upcoming JSON value is the one that matches the query fragment.
421class Query {
422 private:
Nigel Taob48ee752020-03-13 09:27:33 +1100423 uint8_t* m_frag_i;
424 uint8_t* m_frag_j;
425 uint8_t* m_frag_k;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100426
Nigel Taob48ee752020-03-13 09:27:33 +1100427 uint32_t m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100428
Nigel Taob48ee752020-03-13 09:27:33 +1100429 wuffs_base__result_u64 m_array_index;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100430
431 public:
432 void reset(char* query_c_string) {
Nigel Taob48ee752020-03-13 09:27:33 +1100433 m_frag_i = (uint8_t*)query_c_string;
434 m_frag_j = (uint8_t*)query_c_string;
435 m_frag_k = (uint8_t*)query_c_string;
436 m_depth = 0;
437 m_array_index.status.repr = "#main: not an array index query fragment";
438 m_array_index.value = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100439 }
440
Nigel Taob48ee752020-03-13 09:27:33 +1100441 void restart_fragment(bool enable) { m_frag_j = enable ? m_frag_i : nullptr; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100442
Nigel Taob48ee752020-03-13 09:27:33 +1100443 bool is_at(uint32_t depth) { return m_depth == depth; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100444
445 // tick returns whether the fragment is a valid array index whose value is
446 // zero. If valid but non-zero, it decrements it and returns false.
447 bool tick() {
Nigel Taob48ee752020-03-13 09:27:33 +1100448 if (m_array_index.status.is_ok()) {
Nigel Tao0291a472020-08-13 22:40:10 +1000449 if (m_array_index.value == 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100450 return true;
451 }
Nigel Tao0291a472020-08-13 22:40:10 +1000452 m_array_index.value--;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100453 }
454 return false;
455 }
456
457 // next_fragment moves to the next fragment, returning whether it existed.
458 bool next_fragment() {
Nigel Taob48ee752020-03-13 09:27:33 +1100459 uint8_t* k = m_frag_k;
460 uint32_t d = m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100461
462 this->reset(nullptr);
463
464 if (!k || (*k != '/')) {
465 return false;
466 }
467 k++;
468
469 bool all_digits = true;
470 uint8_t* i = k;
471 while ((*k != '\x00') && (*k != '/')) {
472 all_digits = all_digits && ('0' <= *k) && (*k <= '9');
473 k++;
474 }
Nigel Taob48ee752020-03-13 09:27:33 +1100475 m_frag_i = i;
476 m_frag_j = i;
477 m_frag_k = k;
478 m_depth = d + 1;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100479 if (all_digits) {
480 // wuffs_base__parse_number_u64 rejects leading zeroes, e.g. "00", "07".
Nigel Tao6b7ce302020-07-07 16:19:46 +1000481 m_array_index = wuffs_base__parse_number_u64(
482 wuffs_base__make_slice_u8(i, k - i),
483 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100484 }
485 return true;
486 }
487
Nigel Taob48ee752020-03-13 09:27:33 +1100488 bool matched_all() { return m_frag_k == nullptr; }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100489
Nigel Taob48ee752020-03-13 09:27:33 +1100490 bool matched_fragment() { return m_frag_j && (m_frag_j == m_frag_k); }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100491
492 void incremental_match_slice(uint8_t* ptr, size_t len) {
Nigel Taob48ee752020-03-13 09:27:33 +1100493 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100494 return;
495 }
Nigel Taob48ee752020-03-13 09:27:33 +1100496 uint8_t* j = m_frag_j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100497 while (true) {
498 if (len == 0) {
Nigel Taob48ee752020-03-13 09:27:33 +1100499 m_frag_j = j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100500 return;
501 }
502
503 if (*j == '\x00') {
504 break;
505
506 } else if (*j == '~') {
507 j++;
508 if (*j == '0') {
509 if (*ptr != '~') {
510 break;
511 }
512 } else if (*j == '1') {
513 if (*ptr != '/') {
514 break;
515 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100516 } else if (*j == 'n') {
517 if (*ptr != '\n') {
518 break;
519 }
520 } else if (*j == 'r') {
521 if (*ptr != '\r') {
522 break;
523 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100524 } else {
525 break;
526 }
527
528 } else if (*j != *ptr) {
529 break;
530 }
531
532 j++;
533 ptr++;
534 len--;
535 }
Nigel Taob48ee752020-03-13 09:27:33 +1100536 m_frag_j = nullptr;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100537 }
538
539 void incremental_match_code_point(uint32_t code_point) {
Nigel Taob48ee752020-03-13 09:27:33 +1100540 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100541 return;
542 }
543 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
544 size_t n = wuffs_base__utf_8__encode(
545 wuffs_base__make_slice_u8(&u[0],
546 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
547 code_point);
548 if (n > 0) {
549 this->incremental_match_slice(&u[0], n);
550 }
551 }
552
553 // validate returns whether the (ptr, len) arguments form a valid JSON
554 // Pointer. In particular, it must be valid UTF-8, and either be empty or
555 // start with a '/'. Any '~' within must immediately be followed by either
Nigel Taod6fdfb12020-03-11 12:24:14 +1100556 // '0' or '1'. If strict_json_pointer_syntax is false, a '~' may also be
557 // followed by either 'n' or 'r'.
558 static bool validate(char* query_c_string,
559 size_t length,
560 bool strict_json_pointer_syntax) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100561 if (length <= 0) {
562 return true;
563 }
564 if (query_c_string[0] != '/') {
565 return false;
566 }
567 wuffs_base__slice_u8 s =
568 wuffs_base__make_slice_u8((uint8_t*)query_c_string, length);
569 bool previous_was_tilde = false;
570 while (s.len > 0) {
Nigel Tao702c7b22020-07-22 15:42:54 +1000571 wuffs_base__utf_8__next__output o = wuffs_base__utf_8__next(s.ptr, s.len);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100572 if (!o.is_valid()) {
573 return false;
574 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100575
576 if (previous_was_tilde) {
577 switch (o.code_point) {
578 case '0':
579 case '1':
580 break;
581 case 'n':
582 case 'r':
583 if (strict_json_pointer_syntax) {
584 return false;
585 }
586 break;
587 default:
588 return false;
589 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100590 }
591 previous_was_tilde = o.code_point == '~';
Nigel Taod6fdfb12020-03-11 12:24:14 +1100592
Nigel Tao0cd2f982020-03-03 23:03:02 +1100593 s.ptr += o.byte_length;
594 s.len -= o.byte_length;
595 }
596 return !previous_was_tilde;
597 }
Nigel Taod60815c2020-03-26 14:32:35 +1100598} g_query;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100599
600// ----
601
Nigel Tao68920952020-03-03 11:25:18 +1100602struct {
603 int remaining_argc;
604 char** remaining_argv;
605
Nigel Tao3690e832020-03-12 16:52:26 +1100606 bool compact_output;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100607 bool fail_if_unsandboxed;
Nigel Tao0291a472020-08-13 22:40:10 +1000608 bool input_allow_comments;
609 bool input_allow_extra_comma;
610 bool input_allow_inf_nan_numbers;
Nigel Tao21042052020-08-19 23:13:54 +1000611 bool output_comments;
Nigel Tao0291a472020-08-13 22:40:10 +1000612 bool output_extra_comma;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100613 bool strict_json_pointer_syntax;
Nigel Tao68920952020-03-03 11:25:18 +1100614 bool tabs;
Nigel Tao0a0c7d62020-08-18 23:31:27 +1000615
616 uint32_t max_output_depth;
617 uint32_t spaces;
618
619 char* query_c_string;
Nigel Taod60815c2020-03-26 14:32:35 +1100620} g_flags = {0};
Nigel Tao68920952020-03-03 11:25:18 +1100621
622const char* //
623parse_flags(int argc, char** argv) {
Nigel Taoecadf722020-07-13 08:22:34 +1000624 g_flags.spaces = 4;
Nigel Taod60815c2020-03-26 14:32:35 +1100625 g_flags.max_output_depth = 0xFFFFFFFF;
Nigel Tao68920952020-03-03 11:25:18 +1100626
627 int c = (argc > 0) ? 1 : 0; // Skip argv[0], the program name.
628 for (; c < argc; c++) {
629 char* arg = argv[c];
630 if (*arg++ != '-') {
631 break;
632 }
633
634 // A double-dash "--foo" is equivalent to a single-dash "-foo". As special
635 // cases, a bare "-" is not a flag (some programs may interpret it as
636 // stdin) and a bare "--" means to stop parsing flags.
637 if (*arg == '\x00') {
638 break;
639 } else if (*arg == '-') {
640 arg++;
641 if (*arg == '\x00') {
642 c++;
643 break;
644 }
645 }
646
Nigel Tao3690e832020-03-12 16:52:26 +1100647 if (!strcmp(arg, "c") || !strcmp(arg, "compact-output")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100648 g_flags.compact_output = true;
Nigel Tao68920952020-03-03 11:25:18 +1100649 continue;
650 }
Nigel Tao94440cf2020-04-02 22:28:24 +1100651 if (!strcmp(arg, "d") || !strcmp(arg, "max-output-depth")) {
652 g_flags.max_output_depth = 1;
653 continue;
654 } else if (!strncmp(arg, "d=", 2) ||
655 !strncmp(arg, "max-output-depth=", 16)) {
656 while (*arg++ != '=') {
657 }
658 wuffs_base__result_u64 u = wuffs_base__parse_number_u64(
Nigel Tao6b7ce302020-07-07 16:19:46 +1000659 wuffs_base__make_slice_u8((uint8_t*)arg, strlen(arg)),
660 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
Nigel Taoaf757722020-07-18 17:27:11 +1000661 if (u.status.is_ok() && (u.value <= 0xFFFFFFFF)) {
Nigel Tao94440cf2020-04-02 22:28:24 +1100662 g_flags.max_output_depth = (uint32_t)(u.value);
663 continue;
664 }
665 return g_usage;
666 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100667 if (!strcmp(arg, "fail-if-unsandboxed")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100668 g_flags.fail_if_unsandboxed = true;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100669 continue;
670 }
Nigel Tao0291a472020-08-13 22:40:10 +1000671 if (!strcmp(arg, "input-allow-comments")) {
672 g_flags.input_allow_comments = true;
Nigel Tao4e193592020-07-15 12:48:57 +1000673 continue;
674 }
Nigel Tao0291a472020-08-13 22:40:10 +1000675 if (!strcmp(arg, "input-allow-extra-comma")) {
676 g_flags.input_allow_extra_comma = true;
Nigel Tao4e193592020-07-15 12:48:57 +1000677 continue;
678 }
Nigel Tao0291a472020-08-13 22:40:10 +1000679 if (!strcmp(arg, "input-allow-inf-nan-numbers")) {
680 g_flags.input_allow_inf_nan_numbers = true;
Nigel Tao3c8589b2020-07-19 21:49:00 +1000681 continue;
682 }
Nigel Tao21042052020-08-19 23:13:54 +1000683 if (!strcmp(arg, "jwcc")) {
684 g_flags.input_allow_comments = true;
685 g_flags.input_allow_extra_comma = true;
686 g_flags.output_comments = true;
687 g_flags.output_extra_comma = true;
688 continue;
689 }
690 if (!strcmp(arg, "output-comments")) {
691 g_flags.output_comments = true;
692 continue;
693 }
Nigel Tao0291a472020-08-13 22:40:10 +1000694 if (!strcmp(arg, "output-extra-comma")) {
695 g_flags.output_extra_comma = true;
Nigel Taodd114692020-07-25 21:54:12 +1000696 continue;
697 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100698 if (!strncmp(arg, "q=", 2) || !strncmp(arg, "query=", 6)) {
699 while (*arg++ != '=') {
700 }
Nigel Taod60815c2020-03-26 14:32:35 +1100701 g_flags.query_c_string = arg;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100702 continue;
703 }
Nigel Taoecadf722020-07-13 08:22:34 +1000704 if (!strncmp(arg, "s=", 2) || !strncmp(arg, "spaces=", 7)) {
705 while (*arg++ != '=') {
706 }
707 if (('0' <= arg[0]) && (arg[0] <= '8') && (arg[1] == '\x00')) {
708 g_flags.spaces = arg[0] - '0';
709 continue;
710 }
711 return g_usage;
712 }
713 if (!strcmp(arg, "strict-json-pointer-syntax")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100714 g_flags.strict_json_pointer_syntax = true;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100715 continue;
Nigel Tao68920952020-03-03 11:25:18 +1100716 }
717 if (!strcmp(arg, "t") || !strcmp(arg, "tabs")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100718 g_flags.tabs = true;
Nigel Tao68920952020-03-03 11:25:18 +1100719 continue;
720 }
721
Nigel Taod60815c2020-03-26 14:32:35 +1100722 return g_usage;
Nigel Tao68920952020-03-03 11:25:18 +1100723 }
724
Nigel Taod60815c2020-03-26 14:32:35 +1100725 if (g_flags.query_c_string &&
726 !Query::validate(g_flags.query_c_string, strlen(g_flags.query_c_string),
727 g_flags.strict_json_pointer_syntax)) {
Nigel Taod6fdfb12020-03-11 12:24:14 +1100728 return "main: bad JSON Pointer (RFC 6901) syntax for the -query=STR flag";
729 }
730
Nigel Taod60815c2020-03-26 14:32:35 +1100731 g_flags.remaining_argc = argc - c;
732 g_flags.remaining_argv = argv + c;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100733 return nullptr;
Nigel Tao68920952020-03-03 11:25:18 +1100734}
735
Nigel Tao2cf76db2020-02-27 22:42:01 +1100736const char* //
737initialize_globals(int argc, char** argv) {
Nigel Taod60815c2020-03-26 14:32:35 +1100738 g_dst = wuffs_base__make_io_buffer(
739 wuffs_base__make_slice_u8(g_dst_array, DST_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100740 wuffs_base__empty_io_buffer_meta());
Nigel Tao1b073492020-02-16 22:11:36 +1100741
Nigel Taod60815c2020-03-26 14:32:35 +1100742 g_src = wuffs_base__make_io_buffer(
743 wuffs_base__make_slice_u8(g_src_array, SRC_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100744 wuffs_base__empty_io_buffer_meta());
745
Nigel Taod60815c2020-03-26 14:32:35 +1100746 g_tok = wuffs_base__make_token_buffer(
747 wuffs_base__make_slice_token(g_tok_array, TOKEN_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100748 wuffs_base__empty_token_buffer_meta());
749
Nigel Tao991bd512020-08-19 09:38:16 +1000750 g_cursor_index = 0;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100751
Nigel Taod60815c2020-03-26 14:32:35 +1100752 g_depth = 0;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100753
Nigel Taod60815c2020-03-26 14:32:35 +1100754 g_ctx = context::none;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100755
Nigel Tao21042052020-08-19 23:13:54 +1000756 g_is_after_comment = false;
757
Nigel Tao68920952020-03-03 11:25:18 +1100758 TRY(parse_flags(argc, argv));
Nigel Taod60815c2020-03-26 14:32:35 +1100759 if (g_flags.fail_if_unsandboxed && !g_sandboxed) {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100760 return "main: unsandboxed";
761 }
Nigel Tao21042052020-08-19 23:13:54 +1000762 if (g_flags.output_comments && !g_flags.compact_output &&
763 !g_flags.output_extra_comma) {
764 return "main: -output-comments requires one or both of -compact-output and "
765 "-output-extra-comma";
766 }
Nigel Tao01abc842020-03-06 21:42:33 +1100767 const int stdin_fd = 0;
Nigel Taod60815c2020-03-26 14:32:35 +1100768 if (g_flags.remaining_argc >
769 ((g_input_file_descriptor != stdin_fd) ? 1 : 0)) {
770 return g_usage;
Nigel Tao107f0ef2020-03-01 21:35:02 +1100771 }
772
Nigel Tao0a0c7d62020-08-18 23:31:27 +1000773 g_new_line_then_256_indent_bytes =
774 g_flags.tabs ? NEW_LINE_THEN_256_TABS : NEW_LINE_THEN_256_SPACES;
775 g_bytes_per_indent_depth = g_flags.tabs ? 1 : g_flags.spaces;
776
Nigel Taod60815c2020-03-26 14:32:35 +1100777 g_query.reset(g_flags.query_c_string);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100778
Nigel Taoc96b31c2020-07-27 22:37:23 +1000779 // If the query is non-empty, suppress writing to stdout until we've
Nigel Tao0cd2f982020-03-03 23:03:02 +1100780 // completed the query.
Nigel Taod60815c2020-03-26 14:32:35 +1100781 g_suppress_write_dst = g_query.next_fragment() ? 1 : 0;
782 g_wrote_to_dst = false;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100783
Nigel Tao0291a472020-08-13 22:40:10 +1000784 TRY(g_dec.initialize(sizeof__wuffs_json__decoder(), WUFFS_VERSION, 0)
785 .message());
Nigel Tao4b186b02020-03-18 14:25:21 +1100786
Nigel Tao0291a472020-08-13 22:40:10 +1000787 if (g_flags.input_allow_comments) {
788 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_COMMENT_BLOCK, true);
789 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_COMMENT_LINE, true);
Nigel Tao3c8589b2020-07-19 21:49:00 +1000790 }
Nigel Tao0291a472020-08-13 22:40:10 +1000791 if (g_flags.input_allow_extra_comma) {
792 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_EXTRA_COMMA, true);
Nigel Taoc766bb72020-07-09 12:59:32 +1000793 }
Nigel Tao0291a472020-08-13 22:40:10 +1000794 if (g_flags.input_allow_inf_nan_numbers) {
795 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_INF_NAN_NUMBERS, true);
Nigel Tao51a38292020-07-19 22:43:17 +1000796 }
Nigel Taoc766bb72020-07-09 12:59:32 +1000797
Nigel Tao4b186b02020-03-18 14:25:21 +1100798 // Consume an optional whitespace trailer. This isn't part of the JSON spec,
799 // but it works better with line oriented Unix tools (such as "echo 123 |
800 // jsonptr" where it's "echo", not "echo -n") or hand-edited JSON files which
801 // can accidentally contain trailing whitespace.
Nigel Tao0291a472020-08-13 22:40:10 +1000802 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE, true);
Nigel Tao4b186b02020-03-18 14:25:21 +1100803
804 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100805}
Nigel Tao1b073492020-02-16 22:11:36 +1100806
807// ----
808
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100809// ignore_return_value suppresses errors from -Wall -Werror.
810static void //
811ignore_return_value(int ignored) {}
812
Nigel Tao2914bae2020-02-26 09:40:30 +1100813const char* //
814read_src() {
Nigel Taod60815c2020-03-26 14:32:35 +1100815 if (g_src.meta.closed) {
Nigel Tao9cc2c252020-02-23 17:05:49 +1100816 return "main: internal error: read requested on a closed source";
Nigel Taoa8406922020-02-19 12:22:00 +1100817 }
Nigel Taod60815c2020-03-26 14:32:35 +1100818 g_src.compact();
819 if (g_src.meta.wi >= g_src.data.len) {
820 return "main: g_src buffer is full";
Nigel Tao1b073492020-02-16 22:11:36 +1100821 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100822 while (true) {
Nigel Taod6a10df2020-07-27 11:47:47 +1000823 ssize_t n = read(g_input_file_descriptor, g_src.writer_pointer(),
824 g_src.writer_length());
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100825 if (n >= 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100826 g_src.meta.wi += n;
827 g_src.meta.closed = n == 0;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100828 break;
829 } else if (errno != EINTR) {
830 return strerror(errno);
831 }
Nigel Tao1b073492020-02-16 22:11:36 +1100832 }
833 return nullptr;
834}
835
Nigel Tao2914bae2020-02-26 09:40:30 +1100836const char* //
837flush_dst() {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100838 while (true) {
Nigel Taod6a10df2020-07-27 11:47:47 +1000839 size_t n = g_dst.reader_length();
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100840 if (n == 0) {
841 break;
Nigel Tao1b073492020-02-16 22:11:36 +1100842 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100843 const int stdout_fd = 1;
Nigel Taod6a10df2020-07-27 11:47:47 +1000844 ssize_t i = write(stdout_fd, g_dst.reader_pointer(), n);
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100845 if (i >= 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100846 g_dst.meta.ri += i;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100847 } else if (errno != EINTR) {
848 return strerror(errno);
849 }
Nigel Tao1b073492020-02-16 22:11:36 +1100850 }
Nigel Taod60815c2020-03-26 14:32:35 +1100851 g_dst.compact();
Nigel Tao1b073492020-02-16 22:11:36 +1100852 return nullptr;
853}
854
Nigel Tao2914bae2020-02-26 09:40:30 +1100855const char* //
Nigel Tao6b86cbc2020-08-19 11:39:56 +1000856write_dst_slow(const void* s, size_t n) {
Nigel Tao1b073492020-02-16 22:11:36 +1100857 const uint8_t* p = static_cast<const uint8_t*>(s);
858 while (n > 0) {
Nigel Taod6a10df2020-07-27 11:47:47 +1000859 size_t i = g_dst.writer_length();
Nigel Tao1b073492020-02-16 22:11:36 +1100860 if (i == 0) {
861 const char* z = flush_dst();
862 if (z) {
863 return z;
864 }
Nigel Taod6a10df2020-07-27 11:47:47 +1000865 i = g_dst.writer_length();
Nigel Tao1b073492020-02-16 22:11:36 +1100866 if (i == 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100867 return "main: g_dst buffer is full";
Nigel Tao1b073492020-02-16 22:11:36 +1100868 }
869 }
870
871 if (i > n) {
872 i = n;
873 }
Nigel Taod60815c2020-03-26 14:32:35 +1100874 memcpy(g_dst.data.ptr + g_dst.meta.wi, p, i);
875 g_dst.meta.wi += i;
Nigel Tao1b073492020-02-16 22:11:36 +1100876 p += i;
877 n -= i;
Nigel Taod60815c2020-03-26 14:32:35 +1100878 g_wrote_to_dst = true;
Nigel Tao1b073492020-02-16 22:11:36 +1100879 }
880 return nullptr;
881}
882
Nigel Tao6b86cbc2020-08-19 11:39:56 +1000883inline const char* //
884write_dst(const void* s, size_t n) {
885 if (g_suppress_write_dst > 0) {
886 return nullptr;
887 } else if (n <= (DST_BUFFER_ARRAY_SIZE - g_dst.meta.wi)) {
888 memcpy(g_dst.data.ptr + g_dst.meta.wi, s, n);
889 g_dst.meta.wi += n;
890 g_wrote_to_dst = true;
891 return nullptr;
892 }
893 return write_dst_slow(s, n);
894}
895
Nigel Tao21042052020-08-19 23:13:54 +1000896#define TRY_INDENT_WITH_LEADING_NEW_LINE \
897 do { \
898 uint32_t indent = g_depth * g_bytes_per_indent_depth; \
899 TRY(write_dst(g_new_line_then_256_indent_bytes, 1 + (indent & 0xFF))); \
900 for (indent >>= 8; indent > 0; indent--) { \
901 TRY(write_dst(g_new_line_then_256_indent_bytes + 1, 0x100)); \
902 } \
903 } while (false)
904
905// TRY_INDENT_SANS_LEADING_NEW_LINE is used after comments, which print their
906// own "\n".
907#define TRY_INDENT_SANS_LEADING_NEW_LINE \
908 do { \
909 uint32_t indent = g_depth * g_bytes_per_indent_depth; \
910 TRY(write_dst(g_new_line_then_256_indent_bytes + 1, (indent & 0xFF))); \
911 for (indent >>= 8; indent > 0; indent--) { \
912 TRY(write_dst(g_new_line_then_256_indent_bytes + 1, 0x100)); \
913 } \
914 } while (false)
915
Nigel Tao1b073492020-02-16 22:11:36 +1100916// ----
917
Nigel Tao2914bae2020-02-26 09:40:30 +1100918uint8_t //
919hex_digit(uint8_t nibble) {
Nigel Taob5461bd2020-02-21 14:13:37 +1100920 nibble &= 0x0F;
921 if (nibble <= 9) {
922 return '0' + nibble;
923 }
924 return ('A' - 10) + nibble;
925}
926
Nigel Tao2914bae2020-02-26 09:40:30 +1100927const char* //
Nigel Tao7cb76542020-07-19 22:19:04 +1000928handle_unicode_code_point(uint32_t ucp) {
Nigel Tao0291a472020-08-13 22:40:10 +1000929 if (ucp < 0x0020) {
930 switch (ucp) {
931 case '\b':
932 return write_dst("\\b", 2);
933 case '\f':
934 return write_dst("\\f", 2);
935 case '\n':
936 return write_dst("\\n", 2);
937 case '\r':
938 return write_dst("\\r", 2);
939 case '\t':
940 return write_dst("\\t", 2);
Nigel Tao7cb76542020-07-19 22:19:04 +1000941 }
Nigel Tao0291a472020-08-13 22:40:10 +1000942
943 // Other bytes less than 0x0020 are valid UTF-8 but not valid in a
944 // JSON string. They need to remain escaped.
945 uint8_t esc6[6];
946 esc6[0] = '\\';
947 esc6[1] = 'u';
948 esc6[2] = '0';
949 esc6[3] = '0';
950 esc6[4] = hex_digit(ucp >> 4);
951 esc6[5] = hex_digit(ucp >> 0);
952 return write_dst(&esc6[0], 6);
953
954 } else if (ucp == '\"') {
955 return write_dst("\\\"", 2);
956
957 } else if (ucp == '\\') {
958 return write_dst("\\\\", 2);
Nigel Tao7cb76542020-07-19 22:19:04 +1000959 }
960
961 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
962 size_t n = wuffs_base__utf_8__encode(
963 wuffs_base__make_slice_u8(&u[0],
964 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
965 ucp);
966 if (n == 0) {
967 return "main: internal error: unexpected Unicode code point";
968 }
Nigel Tao0291a472020-08-13 22:40:10 +1000969 return write_dst(&u[0], n);
Nigel Tao168f60a2020-07-14 13:19:33 +1000970}
971
Nigel Taod191a3f2020-07-19 22:14:54 +1000972// ----
973
Nigel Tao3b486982020-02-27 15:05:59 +1100974const char* //
Nigel Tao2ef39992020-04-09 17:24:39 +1000975handle_token(wuffs_base__token t, bool start_of_token_chain) {
Nigel Tao2cf76db2020-02-27 22:42:01 +1100976 do {
Nigel Tao462f8662020-04-01 23:01:51 +1100977 int64_t vbc = t.value_base_category();
Nigel Tao2cf76db2020-02-27 22:42:01 +1100978 uint64_t vbd = t.value_base_detail();
Nigel Taoee6927f2020-07-27 12:08:33 +1000979 uint64_t token_length = t.length();
Nigel Tao991bd512020-08-19 09:38:16 +1000980 // The "- token_length" is because we incremented g_cursor_index before
981 // calling handle_token.
Nigel Taoee6927f2020-07-27 12:08:33 +1000982 wuffs_base__slice_u8 tok = wuffs_base__make_slice_u8(
Nigel Tao991bd512020-08-19 09:38:16 +1000983 g_src.data.ptr + g_cursor_index - token_length, token_length);
Nigel Tao1b073492020-02-16 22:11:36 +1100984
985 // Handle ']' or '}'.
Nigel Tao9f7a2502020-02-23 09:42:02 +1100986 if ((vbc == WUFFS_BASE__TOKEN__VBC__STRUCTURE) &&
Nigel Tao2cf76db2020-02-27 22:42:01 +1100987 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__POP)) {
Nigel Taod60815c2020-03-26 14:32:35 +1100988 if (g_query.is_at(g_depth)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100989 return "main: no match for query";
990 }
Nigel Taod60815c2020-03-26 14:32:35 +1100991 if (g_depth <= 0) {
992 return "main: internal error: inconsistent g_depth";
Nigel Tao1b073492020-02-16 22:11:36 +1100993 }
Nigel Taod60815c2020-03-26 14:32:35 +1100994 g_depth--;
Nigel Tao1b073492020-02-16 22:11:36 +1100995
Nigel Taod60815c2020-03-26 14:32:35 +1100996 if (g_query.matched_all() && (g_depth >= g_flags.max_output_depth)) {
997 g_suppress_write_dst--;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100998 // '…' is U+2026 HORIZONTAL ELLIPSIS, which is 3 UTF-8 bytes.
Nigel Tao0291a472020-08-13 22:40:10 +1000999 TRY(write_dst((vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST)
1000 ? "\"[…]\""
1001 : "\"{…}\"",
1002 7));
1003 } else {
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001004 // Write preceding whitespace.
Nigel Taod60815c2020-03-26 14:32:35 +11001005 if ((g_ctx != context::in_list_after_bracket) &&
1006 (g_ctx != context::in_dict_after_brace) &&
1007 !g_flags.compact_output) {
Nigel Tao21042052020-08-19 23:13:54 +10001008 if (g_is_after_comment) {
1009 TRY_INDENT_SANS_LEADING_NEW_LINE;
1010 } else {
1011 if (g_flags.output_extra_comma) {
1012 TRY(write_dst(",", 1));
1013 }
1014 TRY_INDENT_WITH_LEADING_NEW_LINE;
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001015 }
Nigel Tao1b073492020-02-16 22:11:36 +11001016 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001017
1018 TRY(write_dst(
1019 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST) ? "]" : "}",
1020 1));
Nigel Tao1b073492020-02-16 22:11:36 +11001021 }
1022
Nigel Taod60815c2020-03-26 14:32:35 +11001023 g_ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
1024 ? context::in_list_after_value
1025 : context::in_dict_after_key;
Nigel Tao1b073492020-02-16 22:11:36 +11001026 goto after_value;
1027 }
1028
Nigel Taod1c928a2020-02-28 12:43:53 +11001029 // Write preceding whitespace and punctuation, if it wasn't ']', '}' or a
Nigel Tao0291a472020-08-13 22:40:10 +10001030 // continuation of a multi-token chain.
1031 if (start_of_token_chain) {
Nigel Tao21042052020-08-19 23:13:54 +10001032 if (g_is_after_comment) {
1033 TRY_INDENT_SANS_LEADING_NEW_LINE;
1034 } else if (g_ctx == context::in_dict_after_key) {
Nigel Taod60815c2020-03-26 14:32:35 +11001035 TRY(write_dst(": ", g_flags.compact_output ? 1 : 2));
1036 } else if (g_ctx != context::none) {
Nigel Tao0291a472020-08-13 22:40:10 +10001037 if ((g_ctx != context::in_list_after_bracket) &&
1038 (g_ctx != context::in_dict_after_brace)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001039 TRY(write_dst(",", 1));
Nigel Tao107f0ef2020-03-01 21:35:02 +11001040 }
Nigel Taod60815c2020-03-26 14:32:35 +11001041 if (!g_flags.compact_output) {
Nigel Tao21042052020-08-19 23:13:54 +10001042 TRY_INDENT_WITH_LEADING_NEW_LINE;
Nigel Tao0cd2f982020-03-03 23:03:02 +11001043 }
1044 }
1045
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001046 bool query_matched_fragment = false;
Nigel Taod60815c2020-03-26 14:32:35 +11001047 if (g_query.is_at(g_depth)) {
1048 switch (g_ctx) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001049 case context::in_list_after_bracket:
1050 case context::in_list_after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001051 query_matched_fragment = g_query.tick();
Nigel Tao0cd2f982020-03-03 23:03:02 +11001052 break;
1053 case context::in_dict_after_key:
Nigel Taod60815c2020-03-26 14:32:35 +11001054 query_matched_fragment = g_query.matched_fragment();
Nigel Tao0cd2f982020-03-03 23:03:02 +11001055 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +11001056 default:
1057 break;
Nigel Tao0cd2f982020-03-03 23:03:02 +11001058 }
1059 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001060 if (!query_matched_fragment) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001061 // No-op.
Nigel Taod60815c2020-03-26 14:32:35 +11001062 } else if (!g_query.next_fragment()) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001063 // There is no next fragment. We have matched the complete query, and
1064 // the upcoming JSON value is the result of that query.
1065 //
Nigel Taod60815c2020-03-26 14:32:35 +11001066 // Un-suppress writing to stdout and reset the g_ctx and g_depth as if
1067 // we were about to decode a top-level value. This makes any subsequent
1068 // indentation be relative to this point, and we will return g_eod
1069 // after the upcoming JSON value is complete.
1070 if (g_suppress_write_dst != 1) {
1071 return "main: internal error: inconsistent g_suppress_write_dst";
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001072 }
Nigel Taod60815c2020-03-26 14:32:35 +11001073 g_suppress_write_dst = 0;
1074 g_ctx = context::none;
1075 g_depth = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +11001076 } else if ((vbc != WUFFS_BASE__TOKEN__VBC__STRUCTURE) ||
1077 !(vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__PUSH)) {
1078 // The query has moved on to the next fragment but the upcoming JSON
1079 // value is not a container.
1080 return "main: no match for query";
Nigel Tao1b073492020-02-16 22:11:36 +11001081 }
1082 }
1083
1084 // Handle the token itself: either a container ('[' or '{') or a simple
Nigel Tao85fba7f2020-02-29 16:28:06 +11001085 // value: string (a chain of raw or escaped parts), literal or number.
Nigel Tao1b073492020-02-16 22:11:36 +11001086 switch (vbc) {
Nigel Tao85fba7f2020-02-29 16:28:06 +11001087 case WUFFS_BASE__TOKEN__VBC__STRUCTURE:
Nigel Taod60815c2020-03-26 14:32:35 +11001088 if (g_query.matched_all() && (g_depth >= g_flags.max_output_depth)) {
1089 g_suppress_write_dst++;
Nigel Tao0291a472020-08-13 22:40:10 +10001090 } else {
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001091 TRY(write_dst(
1092 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST) ? "[" : "{",
1093 1));
1094 }
Nigel Taod60815c2020-03-26 14:32:35 +11001095 g_depth++;
1096 g_ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
1097 ? context::in_list_after_bracket
1098 : context::in_dict_after_brace;
Nigel Tao85fba7f2020-02-29 16:28:06 +11001099 return nullptr;
1100
Nigel Tao2cf76db2020-02-27 22:42:01 +11001101 case WUFFS_BASE__TOKEN__VBC__STRING:
Nigel Tao0291a472020-08-13 22:40:10 +10001102 if (start_of_token_chain) {
1103 TRY(write_dst("\"", 1));
1104 g_query.restart_fragment(in_dict_before_key() &&
1105 g_query.is_at(g_depth));
1106 }
1107
1108 if (vbd & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_0_DST_1_SRC_DROP) {
1109 // No-op.
1110 } else if (vbd &
1111 WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) {
1112 TRY(write_dst(tok.ptr, tok.len));
1113 g_query.incremental_match_slice(tok.ptr, tok.len);
1114 } else {
1115 return "main: internal error: unexpected string-token conversion";
1116 }
1117
Nigel Tao496e88b2020-04-09 22:10:08 +10001118 if (t.continued()) {
Nigel Tao2cf76db2020-02-27 22:42:01 +11001119 return nullptr;
1120 }
Nigel Tao0291a472020-08-13 22:40:10 +10001121 TRY(write_dst("\"", 1));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001122 goto after_value;
1123
1124 case WUFFS_BASE__TOKEN__VBC__UNICODE_CODE_POINT:
Nigel Tao496e88b2020-04-09 22:10:08 +10001125 if (!t.continued()) {
1126 return "main: internal error: unexpected non-continued UCP token";
Nigel Tao0cd2f982020-03-03 23:03:02 +11001127 }
1128 TRY(handle_unicode_code_point(vbd));
Nigel Taod60815c2020-03-26 14:32:35 +11001129 g_query.incremental_match_code_point(vbd);
Nigel Tao0cd2f982020-03-03 23:03:02 +11001130 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001131
Nigel Tao85fba7f2020-02-29 16:28:06 +11001132 case WUFFS_BASE__TOKEN__VBC__LITERAL:
Nigel Tao2cf76db2020-02-27 22:42:01 +11001133 case WUFFS_BASE__TOKEN__VBC__NUMBER:
Nigel Tao0291a472020-08-13 22:40:10 +10001134 TRY(write_dst(tok.ptr, tok.len));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001135 goto after_value;
Nigel Tao1b073492020-02-16 22:11:36 +11001136 }
1137
Nigel Tao0291a472020-08-13 22:40:10 +10001138 // Return an error if we didn't match the (vbc, vbd) pair.
Nigel Tao2cf76db2020-02-27 22:42:01 +11001139 return "main: internal error: unexpected token";
1140 } while (0);
Nigel Tao1b073492020-02-16 22:11:36 +11001141
Nigel Tao2cf76db2020-02-27 22:42:01 +11001142 // Book-keeping after completing a value (whether a container value or a
1143 // simple value). Empty parent containers are no longer empty. If the parent
1144 // container is a "{...}" object, toggle between keys and values.
1145after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001146 if (g_depth == 0) {
1147 return g_eod;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001148 }
Nigel Taod60815c2020-03-26 14:32:35 +11001149 switch (g_ctx) {
Nigel Tao2cf76db2020-02-27 22:42:01 +11001150 case context::in_list_after_bracket:
Nigel Taod60815c2020-03-26 14:32:35 +11001151 g_ctx = context::in_list_after_value;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001152 break;
1153 case context::in_dict_after_brace:
Nigel Taod60815c2020-03-26 14:32:35 +11001154 g_ctx = context::in_dict_after_key;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001155 break;
1156 case context::in_dict_after_key:
Nigel Taod60815c2020-03-26 14:32:35 +11001157 g_ctx = context::in_dict_after_value;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001158 break;
1159 case context::in_dict_after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001160 g_ctx = context::in_dict_after_key;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001161 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +11001162 default:
1163 break;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001164 }
1165 return nullptr;
1166}
1167
1168const char* //
1169main1(int argc, char** argv) {
1170 TRY(initialize_globals(argc, argv));
1171
Nigel Taocd183f92020-07-14 12:11:05 +10001172 bool start_of_token_chain = true;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001173 while (true) {
Nigel Tao0291a472020-08-13 22:40:10 +10001174 wuffs_base__status status = g_dec.decode_tokens(
Nigel Taod60815c2020-03-26 14:32:35 +11001175 &g_tok, &g_src,
1176 wuffs_base__make_slice_u8(g_work_buffer_array, WORK_BUFFER_ARRAY_SIZE));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001177
Nigel Taod60815c2020-03-26 14:32:35 +11001178 while (g_tok.meta.ri < g_tok.meta.wi) {
1179 wuffs_base__token t = g_tok.data.ptr[g_tok.meta.ri++];
Nigel Tao991bd512020-08-19 09:38:16 +10001180 uint64_t token_length = t.length();
1181 if ((g_src.meta.ri - g_cursor_index) < token_length) {
Nigel Taod60815c2020-03-26 14:32:35 +11001182 return "main: internal error: inconsistent g_src indexes";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001183 }
Nigel Tao991bd512020-08-19 09:38:16 +10001184 g_cursor_index += token_length;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001185
Nigel Tao21042052020-08-19 23:13:54 +10001186 // Handle filler tokens (e.g. whitespace, punctuation and comments).
1187 // These are skipped, unless -output-comments is enabled.
Nigel Tao3c8589b2020-07-19 21:49:00 +10001188 if (t.value_base_category() == WUFFS_BASE__TOKEN__VBC__FILLER) {
Nigel Tao21042052020-08-19 23:13:54 +10001189 if (g_flags.output_comments &&
1190 (t.value_base_detail() &
1191 WUFFS_BASE__TOKEN__VBD__FILLER__COMMENT_ANY)) {
1192 if (g_flags.compact_output) {
1193 TRY(write_dst(g_src.data.ptr + g_cursor_index - token_length,
1194 token_length));
1195 } else {
1196 if (start_of_token_chain) {
1197 if (g_is_after_comment) {
1198 TRY_INDENT_SANS_LEADING_NEW_LINE;
1199 } else if (g_ctx != context::none) {
1200 if (g_ctx == context::in_dict_after_key) {
1201 TRY(write_dst(":", 1));
1202 } else if ((g_ctx != context::in_list_after_bracket) &&
1203 (g_ctx != context::in_dict_after_brace)) {
1204 TRY(write_dst(",", 1));
1205 }
1206 if (!g_flags.compact_output) {
1207 TRY_INDENT_WITH_LEADING_NEW_LINE;
1208 }
1209 }
1210 }
1211 TRY(write_dst(g_src.data.ptr + g_cursor_index - token_length,
1212 token_length));
1213 if (!t.continued() &&
1214 (t.value_base_detail() &
1215 WUFFS_BASE__TOKEN__VBD__FILLER__COMMENT_BLOCK)) {
1216 TRY(write_dst("\n", 1));
1217 }
1218 g_is_after_comment = true;
1219 }
1220 }
Nigel Tao496e88b2020-04-09 22:10:08 +10001221 start_of_token_chain = !t.continued();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001222 continue;
1223 }
1224
Nigel Tao2ef39992020-04-09 17:24:39 +10001225 const char* z = handle_token(t, start_of_token_chain);
Nigel Tao21042052020-08-19 23:13:54 +10001226 g_is_after_comment = false;
Nigel Tao496e88b2020-04-09 22:10:08 +10001227 start_of_token_chain = !t.continued();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001228 if (z == nullptr) {
1229 continue;
Nigel Taod60815c2020-03-26 14:32:35 +11001230 } else if (z == g_eod) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001231 goto end_of_data;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001232 }
1233 return z;
Nigel Tao1b073492020-02-16 22:11:36 +11001234 }
Nigel Tao2cf76db2020-02-27 22:42:01 +11001235
1236 if (status.repr == nullptr) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001237 return "main: internal error: unexpected end of token stream";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001238 } else if (status.repr == wuffs_base__suspension__short_read) {
Nigel Tao991bd512020-08-19 09:38:16 +10001239 if (g_cursor_index != g_src.meta.ri) {
Nigel Taod60815c2020-03-26 14:32:35 +11001240 return "main: internal error: inconsistent g_src indexes";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001241 }
1242 TRY(read_src());
Nigel Tao991bd512020-08-19 09:38:16 +10001243 g_cursor_index = g_src.meta.ri;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001244 } else if (status.repr == wuffs_base__suspension__short_write) {
Nigel Taod60815c2020-03-26 14:32:35 +11001245 g_tok.compact();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001246 } else {
1247 return status.message();
Nigel Tao1b073492020-02-16 22:11:36 +11001248 }
1249 }
Nigel Tao0cd2f982020-03-03 23:03:02 +11001250end_of_data:
1251
Nigel Taod60815c2020-03-26 14:32:35 +11001252 // With a non-empty g_query, don't try to consume trailing whitespace or
Nigel Tao0cd2f982020-03-03 23:03:02 +11001253 // confirm that we've processed all the tokens.
Nigel Taod60815c2020-03-26 14:32:35 +11001254 if (g_flags.query_c_string && *g_flags.query_c_string) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001255 return nullptr;
1256 }
Nigel Tao6b161af2020-02-24 11:01:48 +11001257
Nigel Tao6b161af2020-02-24 11:01:48 +11001258 // Check that we've exhausted the input.
Nigel Taod60815c2020-03-26 14:32:35 +11001259 if ((g_src.meta.ri == g_src.meta.wi) && !g_src.meta.closed) {
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001260 TRY(read_src());
1261 }
Nigel Taod60815c2020-03-26 14:32:35 +11001262 if ((g_src.meta.ri < g_src.meta.wi) || !g_src.meta.closed) {
Nigel Tao0291a472020-08-13 22:40:10 +10001263 return "main: valid JSON followed by further (unexpected) data";
Nigel Tao6b161af2020-02-24 11:01:48 +11001264 }
1265
1266 // Check that we've used all of the decoded tokens, other than trailing
Nigel Tao4b186b02020-03-18 14:25:21 +11001267 // filler tokens. For example, "true\n" is valid JSON (and fully consumed
1268 // with WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE enabled) with a trailing
1269 // filler token for the "\n".
Nigel Taod60815c2020-03-26 14:32:35 +11001270 for (; g_tok.meta.ri < g_tok.meta.wi; g_tok.meta.ri++) {
1271 if (g_tok.data.ptr[g_tok.meta.ri].value_base_category() !=
Nigel Tao6b161af2020-02-24 11:01:48 +11001272 WUFFS_BASE__TOKEN__VBC__FILLER) {
1273 return "main: internal error: decoded OK but unprocessed tokens remain";
1274 }
1275 }
1276
1277 return nullptr;
Nigel Tao1b073492020-02-16 22:11:36 +11001278}
1279
Nigel Tao2914bae2020-02-26 09:40:30 +11001280int //
1281compute_exit_code(const char* status_msg) {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001282 if (!status_msg) {
1283 return 0;
1284 }
Nigel Tao01abc842020-03-06 21:42:33 +11001285 size_t n;
Nigel Taod60815c2020-03-26 14:32:35 +11001286 if (status_msg == g_usage) {
Nigel Tao01abc842020-03-06 21:42:33 +11001287 n = strlen(status_msg);
1288 } else {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001289 n = strnlen(status_msg, 2047);
Nigel Tao01abc842020-03-06 21:42:33 +11001290 if (n >= 2047) {
1291 status_msg = "main: internal error: error message is too long";
1292 n = strnlen(status_msg, 2047);
1293 }
Nigel Tao9cc2c252020-02-23 17:05:49 +11001294 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001295 const int stderr_fd = 2;
1296 ignore_return_value(write(stderr_fd, status_msg, n));
1297 ignore_return_value(write(stderr_fd, "\n", 1));
Nigel Tao9cc2c252020-02-23 17:05:49 +11001298 // Return an exit code of 1 for regular (forseen) errors, e.g. badly
1299 // formatted or unsupported input.
1300 //
1301 // Return an exit code of 2 for internal (exceptional) errors, e.g. defensive
1302 // run-time checks found that an internal invariant did not hold.
1303 //
1304 // Automated testing, including badly formatted inputs, can therefore
1305 // discriminate between expected failure (exit code 1) and unexpected failure
1306 // (other non-zero exit codes). Specifically, exit code 2 for internal
1307 // invariant violation, exit code 139 (which is 128 + SIGSEGV on x86_64
1308 // linux) for a segmentation fault (e.g. null pointer dereference).
1309 return strstr(status_msg, "internal error:") ? 2 : 1;
1310}
1311
Nigel Tao2914bae2020-02-26 09:40:30 +11001312int //
1313main(int argc, char** argv) {
Nigel Tao01abc842020-03-06 21:42:33 +11001314 // Look for an input filename (the first non-flag argument) in argv. If there
1315 // is one, open it (but do not read from it) before we self-impose a sandbox.
1316 //
1317 // Flags start with "-", unless it comes after a bare "--" arg.
1318 {
1319 bool dash_dash = false;
1320 int a;
1321 for (a = 1; a < argc; a++) {
1322 char* arg = argv[a];
1323 if ((arg[0] == '-') && !dash_dash) {
1324 dash_dash = (arg[1] == '-') && (arg[2] == '\x00');
1325 continue;
1326 }
Nigel Taod60815c2020-03-26 14:32:35 +11001327 g_input_file_descriptor = open(arg, O_RDONLY);
1328 if (g_input_file_descriptor < 0) {
Nigel Tao01abc842020-03-06 21:42:33 +11001329 fprintf(stderr, "%s: %s\n", arg, strerror(errno));
1330 return 1;
1331 }
1332 break;
1333 }
1334 }
1335
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001336#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1337 prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
Nigel Taod60815c2020-03-26 14:32:35 +11001338 g_sandboxed = true;
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001339#endif
1340
Nigel Tao0cd2f982020-03-03 23:03:02 +11001341 const char* z = main1(argc, argv);
Nigel Taod60815c2020-03-26 14:32:35 +11001342 if (g_wrote_to_dst) {
Nigel Tao0291a472020-08-13 22:40:10 +10001343 const char* z1 = write_dst("\n", 1);
Nigel Tao0cd2f982020-03-03 23:03:02 +11001344 const char* z2 = flush_dst();
1345 z = z ? z : (z1 ? z1 : z2);
1346 }
1347 int exit_code = compute_exit_code(z);
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001348
1349#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1350 // Call SYS_exit explicitly, instead of calling SYS_exit_group implicitly by
1351 // either calling _exit or returning from main. SECCOMP_MODE_STRICT allows
1352 // only SYS_exit.
1353 syscall(SYS_exit, exit_code);
1354#endif
Nigel Tao9cc2c252020-02-23 17:05:49 +11001355 return exit_code;
Nigel Tao1b073492020-02-16 22:11:36 +11001356}