blob: a820961553b3e3be4223ee95350192a1f7ae11d0 [file] [log] [blame]
Nigel Tao1b073492020-02-16 22:11:36 +11001// Copyright 2020 The Wuffs Authors.
2//
3// Licensed under the Apache License, Version 2.0 (the "License");
4// you may not use this file except in compliance with the License.
5// You may obtain a copy of the License at
6//
7// https://www.apache.org/licenses/LICENSE-2.0
8//
9// Unless required by applicable law or agreed to in writing, software
10// distributed under the License is distributed on an "AS IS" BASIS,
11// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12// See the License for the specific language governing permissions and
13// limitations under the License.
14
15// ----------------
16
17/*
Nigel Tao0cd2f982020-03-03 23:03:02 +110018jsonptr is a JSON formatter (pretty-printer) that supports the JSON Pointer
19(RFC 6901) query syntax. It reads UTF-8 JSON from stdin and writes
20canonicalized, formatted UTF-8 JSON to stdout.
21
22See the "const char* usage" string below for details.
23
24----
25
26JSON Pointer (and this program's implementation) is one of many JSON query
27languages and JSON tools, such as jq, jql and JMESPath. This one is relatively
28simple and fewer-featured compared to those others.
29
30One benefit of simplicity is that this program's JSON and JSON Pointer
31implementations do not dynamically allocate or free memory (yet it does not
32require that the entire input fits in memory at once). They are therefore
33trivially protected against certain bug classes: memory leaks, double-frees and
34use-after-frees.
35
36The core JSON implementation is also written in the Wuffs programming language
Nigel Taof2eb7012020-03-16 21:10:20 +110037(and then transpiled to C/C++), which is memory-safe (e.g. array indexing is
38bounds-checked) but also guards against integer arithmetic overflows.
Nigel Tao0cd2f982020-03-03 23:03:02 +110039
Nigel Taofe0cbbd2020-03-05 22:01:30 +110040For defense in depth, on Linux, this program also self-imposes a
41SECCOMP_MODE_STRICT sandbox before reading (or otherwise processing) its input
42or writing its output. Under this sandbox, the only permitted system calls are
43read, write, exit and sigreturn.
44
Nigel Tao0cd2f982020-03-03 23:03:02 +110045All together, this program aims to safely handle untrusted JSON files without
46fear of security bugs such as remote code execution.
47
48----
Nigel Tao1b073492020-02-16 22:11:36 +110049
Nigel Taoc5b3a9e2020-02-24 11:54:35 +110050As of 2020-02-24, this program passes all 318 "test_parsing" cases from the
51JSON test suite (https://github.com/nst/JSONTestSuite), an appendix to the
52"Parsing JSON is a Minefield" article (http://seriot.ch/parsing_json.php) that
53was first published on 2016-10-26 and updated on 2018-03-30.
54
Nigel Tao0cd2f982020-03-03 23:03:02 +110055After modifying this program, run "build-example.sh example/jsonptr/" and then
56"script/run-json-test-suite.sh" to catch correctness regressions.
57
58----
59
Nigel Taod0b16cb2020-03-14 10:15:54 +110060This program uses Wuffs' JSON decoder at a relatively low level, processing the
61decoder's token-stream output individually. The core loop, in pseudo-code, is
62"for_each_token { handle_token(etc); }", where the handle_token function
63changes global state (e.g. the `depth` and `context` variables) and prints
64output text based on that state and the token's source text. Notably,
65handle_token is not recursive, even though JSON values can nest.
66
67This approach is centered around JSON tokens. Each JSON 'thing' (e.g. number,
68string, object) comprises one or more JSON tokens.
69
70An alternative, higher-level approach is in the sibling example/jsonfindptrs
71program. Neither approach is better or worse per se, but when studying this
72program, be aware that there are multiple ways to use Wuffs' JSON decoder.
73
74The two programs, jsonfindptrs and jsonptr, also demonstrate different
75trade-offs with regard to JSON object duplicate keys. The JSON spec permits
76different implementations to allow or reject duplicate keys. It is not always
77clear which approach is safer. Rejecting them is certainly unambiguous, and
78security bugs can lurk in ambiguous corners of a file format, if two different
79implementations both silently accept a file but differ on how to interpret it.
80On the other hand, in the worst case, detecting duplicate keys requires O(N)
81memory, where N is the size of the (potentially untrusted) input.
82
83This program (jsonptr) allows duplicate keys and requires only O(1) memory. As
84mentioned above, it doesn't dynamically allocate memory at all, and on Linux,
85it runs in a SECCOMP_MODE_STRICT sandbox.
86
87----
88
Nigel Tao1b073492020-02-16 22:11:36 +110089This example program differs from most other example Wuffs programs in that it
90is written in C++, not C.
91
92$CXX jsonptr.cc && ./a.out < ../../test/data/github-tags.json; rm -f a.out
93
94for a C++ compiler $CXX, such as clang++ or g++.
95*/
96
Nigel Taofe0cbbd2020-03-05 22:01:30 +110097#include <errno.h>
Nigel Tao01abc842020-03-06 21:42:33 +110098#include <fcntl.h>
99#include <stdio.h>
Nigel Tao9cc2c252020-02-23 17:05:49 +1100100#include <string.h>
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100101#include <unistd.h>
Nigel Tao1b073492020-02-16 22:11:36 +1100102
103// Wuffs ships as a "single file C library" or "header file library" as per
104// https://github.com/nothings/stb/blob/master/docs/stb_howto.txt
105//
106// To use that single file as a "foo.c"-like implementation, instead of a
107// "foo.h"-like header, #define WUFFS_IMPLEMENTATION before #include'ing or
108// compiling it.
109#define WUFFS_IMPLEMENTATION
110
111// Defining the WUFFS_CONFIG__MODULE* macros are optional, but it lets users of
112// release/c/etc.c whitelist which parts of Wuffs to build. That file contains
113// the entire Wuffs standard library, implementing a variety of codecs and file
114// formats. Without this macro definition, an optimizing compiler or linker may
115// very well discard Wuffs code for unused codecs, but listing the Wuffs
116// modules we use makes that process explicit. Preprocessing means that such
117// code simply isn't compiled.
118#define WUFFS_CONFIG__MODULES
119#define WUFFS_CONFIG__MODULE__BASE
120#define WUFFS_CONFIG__MODULE__JSON
121
122// If building this program in an environment that doesn't easily accommodate
123// relative includes, you can use the script/inline-c-relative-includes.go
124// program to generate a stand-alone C++ file.
125#include "../../release/c/wuffs-unsupported-snapshot.c"
126
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100127#if defined(__linux__)
128#include <linux/prctl.h>
129#include <linux/seccomp.h>
130#include <sys/prctl.h>
131#include <sys/syscall.h>
132#define WUFFS_EXAMPLE_USE_SECCOMP
133#endif
134
Nigel Tao2cf76db2020-02-27 22:42:01 +1100135#define TRY(error_msg) \
136 do { \
137 const char* z = error_msg; \
138 if (z) { \
139 return z; \
140 } \
141 } while (false)
142
143static const char* eod = "main: end of data";
144
Nigel Tao0cd2f982020-03-03 23:03:02 +1100145static const char* usage =
Nigel Tao01abc842020-03-06 21:42:33 +1100146 "Usage: jsonptr -flags input.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100147 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100148 "Flags:\n"
Nigel Tao3690e832020-03-12 16:52:26 +1100149 " -c -compact-output\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100150 " -i=NUM -indent=NUM\n"
151 " -o=NUM -max-output-depth=NUM\n"
152 " -q=STR -query=STR\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100153 " -s -strict-json-pointer-syntax\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100154 " -t -tabs\n"
155 " -fail-if-unsandboxed\n"
156 "\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100157 "The input.json filename is optional. If absent, it reads from stdin.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100158 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100159 "----\n"
160 "\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100161 "jsonptr is a JSON formatter (pretty-printer) that supports the JSON\n"
162 "Pointer (RFC 6901) query syntax. It reads UTF-8 JSON from stdin and\n"
163 "writes canonicalized, formatted UTF-8 JSON to stdout.\n"
164 "\n"
165 "Canonicalized means that e.g. \"abc\\u000A\\tx\\u0177z\" is re-written\n"
166 "as \"abc\\n\\txÅ·z\". It does not sort object keys, nor does it reject\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100167 "duplicate keys. Canonicalization does not imply Unicode normalization.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100168 "\n"
169 "Formatted means that arrays' and objects' elements are indented, each\n"
Nigel Tao3690e832020-03-12 16:52:26 +1100170 "on its own line. Configure this with the -c / -compact-output, -i=NUM /\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100171 "-indent=NUM (for NUM ranging from 0 to 8) and -t / -tabs flags.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100172 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100173 "----\n"
174 "\n"
175 "The -q=STR or -query=STR flag gives an optional JSON Pointer query, to\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100176 "print a subset of the input. For example, given RFC 6901 section 5's\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100177 "sample input (https://tools.ietf.org/rfc/rfc6901.txt), this command:\n"
178 " jsonptr -query=/foo/1 rfc-6901-json-pointer.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100179 "will print:\n"
180 " \"baz\"\n"
181 "\n"
182 "An absent query is equivalent to the empty query, which identifies the\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100183 "entire input (the root value). Unlike a file system, the \"/\" query\n"
Nigel Taod0b16cb2020-03-14 10:15:54 +1100184 "does not identify the root. Instead, \"\" is the root and \"/\" is the\n"
185 "child (the value in a key-value pair) of the root whose key is the empty\n"
186 "string. Similarly, \"/xyz\" and \"/xyz/\" are two different nodes.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100187 "\n"
188 "If the query found a valid JSON value, this program will return a zero\n"
189 "exit code even if the rest of the input isn't valid JSON. If the query\n"
190 "did not find a value, or found an invalid one, this program returns a\n"
191 "non-zero exit code, but may still print partial output to stdout.\n"
192 "\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100193 "The JSON specification (https://json.org/) permits implementations that\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100194 "allow duplicate keys, as this one does. This JSON Pointer implementation\n"
195 "is also greedy, following the first match for each fragment without\n"
196 "back-tracking. For example, the \"/foo/bar\" query will fail if the root\n"
197 "object has multiple \"foo\" children but the first one doesn't have a\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100198 "\"bar\" child, even if later ones do.\n"
199 "\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100200 "The -s or -strict-json-pointer-syntax flag restricts the -query=STR\n"
201 "string to exactly RFC 6901, with only two escape sequences: \"~0\" and\n"
202 "\"~1\" for \"~\" and \"/\". Without this flag, this program also lets\n"
203 "\"~n\" and \"~r\" escape the New Line and Carriage Return ASCII control\n"
204 "characters, which can work better with line oriented Unix tools that\n"
205 "assume exactly one value (i.e. one JSON Pointer string) per line.\n"
206 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100207 "----\n"
208 "\n"
209 "The -o=NUM or -max-output-depth=NUM flag gives the maximum (inclusive)\n"
210 "output depth. JSON containers ([] arrays and {} objects) can hold other\n"
211 "containers. When this flag is set, containers at depth NUM are replaced\n"
212 "with \"[…]\" or \"{…}\". A bare -o or -max-output-depth is equivalent to\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100213 "-o=1. The flag's absence is equivalent to an unlimited output depth.\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100214 "\n"
215 "The -max-output-depth flag only affects the program's output. It doesn't\n"
216 "affect whether or not the input is considered valid JSON. The JSON\n"
217 "specification permits implementations to set their own maximum input\n"
218 "depth. This JSON implementation sets it to 1024.\n"
219 "\n"
220 "Depth is measured in terms of nested containers. It is unaffected by the\n"
221 "number of spaces or tabs used to indent.\n"
222 "\n"
223 "When both -max-output-depth and -query are set, the output depth is\n"
224 "measured from when the query resolves, not from the input root. The\n"
225 "input depth (measured from the root) is still limited to 1024.\n"
226 "\n"
227 "----\n"
228 "\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100229 "The -fail-if-unsandboxed flag causes the program to exit if it does not\n"
230 "self-impose a sandbox. On Linux, it self-imposes a SECCOMP_MODE_STRICT\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100231 "sandbox, regardless of whether this flag was set.";
Nigel Tao0cd2f982020-03-03 23:03:02 +1100232
Nigel Tao2cf76db2020-02-27 22:42:01 +1100233// ----
234
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100235bool sandboxed = false;
236
Nigel Tao01abc842020-03-06 21:42:33 +1100237int input_file_descriptor = 0; // A 0 default means stdin.
238
Nigel Tao2cf76db2020-02-27 22:42:01 +1100239#define MAX_INDENT 8
Nigel Tao107f0ef2020-03-01 21:35:02 +1100240#define INDENT_SPACES_STRING " "
Nigel Tao6e7d1412020-03-06 09:21:35 +1100241#define INDENT_TAB_STRING "\t"
Nigel Tao107f0ef2020-03-01 21:35:02 +1100242
Nigel Taofdac24a2020-03-06 21:53:08 +1100243#ifndef DST_BUFFER_ARRAY_SIZE
244#define DST_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100245#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100246#ifndef SRC_BUFFER_ARRAY_SIZE
247#define SRC_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100248#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100249#ifndef TOKEN_BUFFER_ARRAY_SIZE
250#define TOKEN_BUFFER_ARRAY_SIZE (4 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100251#endif
252
Nigel Taofdac24a2020-03-06 21:53:08 +1100253uint8_t dst_array[DST_BUFFER_ARRAY_SIZE];
254uint8_t src_array[SRC_BUFFER_ARRAY_SIZE];
255wuffs_base__token tok_array[TOKEN_BUFFER_ARRAY_SIZE];
Nigel Tao1b073492020-02-16 22:11:36 +1100256
257wuffs_base__io_buffer dst;
258wuffs_base__io_buffer src;
259wuffs_base__token_buffer tok;
260
Nigel Tao2cf76db2020-02-27 22:42:01 +1100261// curr_token_end_src_index is the src.data.ptr index of the end of the current
262// token. An invariant is that (curr_token_end_src_index <= src.meta.ri).
263size_t curr_token_end_src_index;
264
Nigel Tao0cd2f982020-03-03 23:03:02 +1100265uint32_t depth;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100266
267enum class context {
268 none,
269 in_list_after_bracket,
270 in_list_after_value,
271 in_dict_after_brace,
272 in_dict_after_key,
273 in_dict_after_value,
274} ctx;
275
Nigel Tao0cd2f982020-03-03 23:03:02 +1100276bool //
277in_dict_before_key() {
278 return (ctx == context::in_dict_after_brace) ||
279 (ctx == context::in_dict_after_value);
280}
281
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100282uint32_t suppress_write_dst;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100283bool wrote_to_dst;
284
Nigel Tao1b073492020-02-16 22:11:36 +1100285wuffs_json__decoder dec;
Nigel Tao1b073492020-02-16 22:11:36 +1100286
Nigel Tao0cd2f982020-03-03 23:03:02 +1100287// ----
288
289// Query is a JSON Pointer query. After initializing with a NUL-terminated C
290// string, its multiple fragments are consumed as the program walks the JSON
291// data from stdin. For example, letting "$" denote a NUL, suppose that we
292// started with a query string of "/apple/banana/12/durian" and are currently
Nigel Taob48ee752020-03-13 09:27:33 +1100293// trying to match the second fragment, "banana", so that Query::m_depth is 2:
Nigel Tao0cd2f982020-03-03 23:03:02 +1100294//
295// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
296// / a p p l e / b a n a n a / 1 2 / d u r i a n $
297// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
298// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100299// m_frag_i m_frag_k
Nigel Tao0cd2f982020-03-03 23:03:02 +1100300//
Nigel Taob48ee752020-03-13 09:27:33 +1100301// The two pointers m_frag_i and m_frag_k (abbreviated as mfi and mfk) are the
302// start (inclusive) and end (exclusive) of the query fragment. They satisfy
303// (mfi <= mfk) and may be equal if the fragment empty (note that "" is a valid
304// JSON object key).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100305//
Nigel Taob48ee752020-03-13 09:27:33 +1100306// The m_frag_j (mfj) pointer moves between these two, or is nullptr. An
307// invariant is that (((mfi <= mfj) && (mfj <= mfk)) || (mfj == nullptr)).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100308//
309// Wuffs' JSON tokenizer can portray a single JSON string as multiple Wuffs
310// tokens, as backslash-escaped values within that JSON string may each get
311// their own token.
312//
Nigel Taob48ee752020-03-13 09:27:33 +1100313// At the start of each object key (a JSON string), mfj is set to mfi.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100314//
Nigel Taob48ee752020-03-13 09:27:33 +1100315// While mfj remains non-nullptr, each token's unescaped contents are then
316// compared to that part of the fragment from mfj to mfk. If it is a prefix
317// (including the case of an exact match), then mfj is advanced by the
318// unescaped length. Otherwise, mfj is set to nullptr.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100319//
320// Comparison accounts for JSON Pointer's escaping notation: "~0" and "~1" in
321// the query (not the JSON value) are unescaped to "~" and "/" respectively.
Nigel Taob48ee752020-03-13 09:27:33 +1100322// "~n" and "~r" are also unescaped to "\n" and "\r". The program is
323// responsible for calling Query::validate (with a strict_json_pointer_syntax
324// argument) before otherwise using this class.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100325//
Nigel Taob48ee752020-03-13 09:27:33 +1100326// The mfj pointer therefore advances from mfi to mfk, or drops out, as we
327// incrementally match the object key with the query fragment. For example, if
328// we have already matched the "ban" of "banana", then we would accept any of
329// an "ana" token, an "a" token or a "\u0061" token, amongst others. They would
330// advance mfj by 3, 1 or 1 bytes respectively.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100331//
Nigel Taob48ee752020-03-13 09:27:33 +1100332// mfj
Nigel Tao0cd2f982020-03-03 23:03:02 +1100333// v
334// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
335// / a p p l e / b a n a n a / 1 2 / d u r i a n $
336// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
337// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100338// mfi mfk
Nigel Tao0cd2f982020-03-03 23:03:02 +1100339//
340// At the end of each object key (or equivalently, at the start of each object
Nigel Taob48ee752020-03-13 09:27:33 +1100341// value), if mfj is non-nullptr and equal to (but not less than) mfk then we
342// have a fragment match: the query fragment equals the object key. If there is
343// a next fragment (in this example, "12") we move the frag_etc pointers to its
344// start and end and increment Query::m_depth. Otherwise, we have matched the
345// complete query, and the upcoming JSON value is the result of that query.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100346//
347// The discussion above centers on object keys. If the query fragment is
348// numeric then it can also match as an array index: the string fragment "12"
349// will match an array's 13th element (starting counting from zero). See RFC
350// 6901 for its precise definition of an "array index" number.
351//
Nigel Taob48ee752020-03-13 09:27:33 +1100352// Array index fragment match is represented by the Query::m_array_index field,
Nigel Tao0cd2f982020-03-03 23:03:02 +1100353// whose type (wuffs_base__result_u64) is a result type. An error result means
354// that the fragment is not an array index. A value result holds the number of
355// list elements remaining. When matching a query fragment in an array (instead
356// of in an object), each element ticks this number down towards zero. At zero,
357// the upcoming JSON value is the one that matches the query fragment.
358class Query {
359 private:
Nigel Taob48ee752020-03-13 09:27:33 +1100360 uint8_t* m_frag_i;
361 uint8_t* m_frag_j;
362 uint8_t* m_frag_k;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100363
Nigel Taob48ee752020-03-13 09:27:33 +1100364 uint32_t m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100365
Nigel Taob48ee752020-03-13 09:27:33 +1100366 wuffs_base__result_u64 m_array_index;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100367
368 public:
369 void reset(char* query_c_string) {
Nigel Taob48ee752020-03-13 09:27:33 +1100370 m_frag_i = (uint8_t*)query_c_string;
371 m_frag_j = (uint8_t*)query_c_string;
372 m_frag_k = (uint8_t*)query_c_string;
373 m_depth = 0;
374 m_array_index.status.repr = "#main: not an array index query fragment";
375 m_array_index.value = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100376 }
377
Nigel Taob48ee752020-03-13 09:27:33 +1100378 void restart_fragment(bool enable) { m_frag_j = enable ? m_frag_i : nullptr; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100379
Nigel Taob48ee752020-03-13 09:27:33 +1100380 bool is_at(uint32_t depth) { return m_depth == depth; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100381
382 // tick returns whether the fragment is a valid array index whose value is
383 // zero. If valid but non-zero, it decrements it and returns false.
384 bool tick() {
Nigel Taob48ee752020-03-13 09:27:33 +1100385 if (m_array_index.status.is_ok()) {
386 if (m_array_index.value == 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100387 return true;
388 }
Nigel Taob48ee752020-03-13 09:27:33 +1100389 m_array_index.value--;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100390 }
391 return false;
392 }
393
394 // next_fragment moves to the next fragment, returning whether it existed.
395 bool next_fragment() {
Nigel Taob48ee752020-03-13 09:27:33 +1100396 uint8_t* k = m_frag_k;
397 uint32_t d = m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100398
399 this->reset(nullptr);
400
401 if (!k || (*k != '/')) {
402 return false;
403 }
404 k++;
405
406 bool all_digits = true;
407 uint8_t* i = k;
408 while ((*k != '\x00') && (*k != '/')) {
409 all_digits = all_digits && ('0' <= *k) && (*k <= '9');
410 k++;
411 }
Nigel Taob48ee752020-03-13 09:27:33 +1100412 m_frag_i = i;
413 m_frag_j = i;
414 m_frag_k = k;
415 m_depth = d + 1;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100416 if (all_digits) {
417 // wuffs_base__parse_number_u64 rejects leading zeroes, e.g. "00", "07".
Nigel Taob48ee752020-03-13 09:27:33 +1100418 m_array_index =
Nigel Tao0cd2f982020-03-03 23:03:02 +1100419 wuffs_base__parse_number_u64(wuffs_base__make_slice_u8(i, k - i));
420 }
421 return true;
422 }
423
Nigel Taob48ee752020-03-13 09:27:33 +1100424 bool matched_all() { return m_frag_k == nullptr; }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100425
Nigel Taob48ee752020-03-13 09:27:33 +1100426 bool matched_fragment() { return m_frag_j && (m_frag_j == m_frag_k); }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100427
428 void incremental_match_slice(uint8_t* ptr, size_t len) {
Nigel Taob48ee752020-03-13 09:27:33 +1100429 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100430 return;
431 }
Nigel Taob48ee752020-03-13 09:27:33 +1100432 uint8_t* j = m_frag_j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100433 while (true) {
434 if (len == 0) {
Nigel Taob48ee752020-03-13 09:27:33 +1100435 m_frag_j = j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100436 return;
437 }
438
439 if (*j == '\x00') {
440 break;
441
442 } else if (*j == '~') {
443 j++;
444 if (*j == '0') {
445 if (*ptr != '~') {
446 break;
447 }
448 } else if (*j == '1') {
449 if (*ptr != '/') {
450 break;
451 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100452 } else if (*j == 'n') {
453 if (*ptr != '\n') {
454 break;
455 }
456 } else if (*j == 'r') {
457 if (*ptr != '\r') {
458 break;
459 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100460 } else {
461 break;
462 }
463
464 } else if (*j != *ptr) {
465 break;
466 }
467
468 j++;
469 ptr++;
470 len--;
471 }
Nigel Taob48ee752020-03-13 09:27:33 +1100472 m_frag_j = nullptr;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100473 }
474
475 void incremental_match_code_point(uint32_t code_point) {
Nigel Taob48ee752020-03-13 09:27:33 +1100476 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100477 return;
478 }
479 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
480 size_t n = wuffs_base__utf_8__encode(
481 wuffs_base__make_slice_u8(&u[0],
482 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
483 code_point);
484 if (n > 0) {
485 this->incremental_match_slice(&u[0], n);
486 }
487 }
488
489 // validate returns whether the (ptr, len) arguments form a valid JSON
490 // Pointer. In particular, it must be valid UTF-8, and either be empty or
491 // start with a '/'. Any '~' within must immediately be followed by either
Nigel Taod6fdfb12020-03-11 12:24:14 +1100492 // '0' or '1'. If strict_json_pointer_syntax is false, a '~' may also be
493 // followed by either 'n' or 'r'.
494 static bool validate(char* query_c_string,
495 size_t length,
496 bool strict_json_pointer_syntax) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100497 if (length <= 0) {
498 return true;
499 }
500 if (query_c_string[0] != '/') {
501 return false;
502 }
503 wuffs_base__slice_u8 s =
504 wuffs_base__make_slice_u8((uint8_t*)query_c_string, length);
505 bool previous_was_tilde = false;
506 while (s.len > 0) {
507 wuffs_base__utf_8__next__output o = wuffs_base__utf_8__next(s);
508 if (!o.is_valid()) {
509 return false;
510 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100511
512 if (previous_was_tilde) {
513 switch (o.code_point) {
514 case '0':
515 case '1':
516 break;
517 case 'n':
518 case 'r':
519 if (strict_json_pointer_syntax) {
520 return false;
521 }
522 break;
523 default:
524 return false;
525 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100526 }
527 previous_was_tilde = o.code_point == '~';
Nigel Taod6fdfb12020-03-11 12:24:14 +1100528
Nigel Tao0cd2f982020-03-03 23:03:02 +1100529 s.ptr += o.byte_length;
530 s.len -= o.byte_length;
531 }
532 return !previous_was_tilde;
533 }
534} query;
535
536// ----
537
Nigel Tao68920952020-03-03 11:25:18 +1100538struct {
539 int remaining_argc;
540 char** remaining_argv;
541
Nigel Tao3690e832020-03-12 16:52:26 +1100542 bool compact_output;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100543 bool fail_if_unsandboxed;
Nigel Tao68920952020-03-03 11:25:18 +1100544 size_t indent;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100545 uint32_t max_output_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100546 char* query_c_string;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100547 bool strict_json_pointer_syntax;
Nigel Tao68920952020-03-03 11:25:18 +1100548 bool tabs;
549} flags = {0};
550
551const char* //
552parse_flags(int argc, char** argv) {
Nigel Tao6e7d1412020-03-06 09:21:35 +1100553 flags.indent = 4;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100554 flags.max_output_depth = 0xFFFFFFFF;
Nigel Tao68920952020-03-03 11:25:18 +1100555
556 int c = (argc > 0) ? 1 : 0; // Skip argv[0], the program name.
557 for (; c < argc; c++) {
558 char* arg = argv[c];
559 if (*arg++ != '-') {
560 break;
561 }
562
563 // A double-dash "--foo" is equivalent to a single-dash "-foo". As special
564 // cases, a bare "-" is not a flag (some programs may interpret it as
565 // stdin) and a bare "--" means to stop parsing flags.
566 if (*arg == '\x00') {
567 break;
568 } else if (*arg == '-') {
569 arg++;
570 if (*arg == '\x00') {
571 c++;
572 break;
573 }
574 }
575
Nigel Tao3690e832020-03-12 16:52:26 +1100576 if (!strcmp(arg, "c") || !strcmp(arg, "compact-output")) {
577 flags.compact_output = true;
Nigel Tao68920952020-03-03 11:25:18 +1100578 continue;
579 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100580 if (!strcmp(arg, "fail-if-unsandboxed")) {
581 flags.fail_if_unsandboxed = true;
582 continue;
583 }
Nigel Tao68920952020-03-03 11:25:18 +1100584 if (!strncmp(arg, "i=", 2) || !strncmp(arg, "indent=", 7)) {
585 while (*arg++ != '=') {
586 }
587 if (('0' <= arg[0]) && (arg[0] <= '8') && (arg[1] == '\x00')) {
588 flags.indent = arg[0] - '0';
Nigel Tao68920952020-03-03 11:25:18 +1100589 continue;
590 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100591 return usage;
592 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100593 if (!strcmp(arg, "o") || !strcmp(arg, "max-output-depth")) {
594 flags.max_output_depth = 1;
595 continue;
596 } else if (!strncmp(arg, "o=", 2) ||
597 !strncmp(arg, "max-output-depth=", 16)) {
598 while (*arg++ != '=') {
599 }
600 wuffs_base__result_u64 u = wuffs_base__parse_number_u64(
601 wuffs_base__make_slice_u8((uint8_t*)arg, strlen(arg)));
602 if (wuffs_base__status__is_ok(&u.status) && (u.value <= 0xFFFFFFFF)) {
603 flags.max_output_depth = (uint32_t)(u.value);
604 continue;
605 }
606 return usage;
607 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100608 if (!strncmp(arg, "q=", 2) || !strncmp(arg, "query=", 6)) {
609 while (*arg++ != '=') {
610 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100611 flags.query_c_string = arg;
612 continue;
613 }
614 if (!strcmp(arg, "s") || !strcmp(arg, "strict-json-pointer-syntax")) {
615 flags.strict_json_pointer_syntax = true;
616 continue;
Nigel Tao68920952020-03-03 11:25:18 +1100617 }
618 if (!strcmp(arg, "t") || !strcmp(arg, "tabs")) {
619 flags.tabs = true;
620 continue;
621 }
622
Nigel Tao0cd2f982020-03-03 23:03:02 +1100623 return usage;
Nigel Tao68920952020-03-03 11:25:18 +1100624 }
625
Nigel Taod6fdfb12020-03-11 12:24:14 +1100626 if (flags.query_c_string &&
627 !Query::validate(flags.query_c_string, strlen(flags.query_c_string),
628 flags.strict_json_pointer_syntax)) {
629 return "main: bad JSON Pointer (RFC 6901) syntax for the -query=STR flag";
630 }
631
Nigel Tao68920952020-03-03 11:25:18 +1100632 flags.remaining_argc = argc - c;
633 flags.remaining_argv = argv + c;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100634 return nullptr;
Nigel Tao68920952020-03-03 11:25:18 +1100635}
636
Nigel Tao2cf76db2020-02-27 22:42:01 +1100637const char* //
638initialize_globals(int argc, char** argv) {
Nigel Tao2cf76db2020-02-27 22:42:01 +1100639 dst = wuffs_base__make_io_buffer(
Nigel Taofdac24a2020-03-06 21:53:08 +1100640 wuffs_base__make_slice_u8(dst_array, DST_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100641 wuffs_base__empty_io_buffer_meta());
Nigel Tao1b073492020-02-16 22:11:36 +1100642
Nigel Tao2cf76db2020-02-27 22:42:01 +1100643 src = wuffs_base__make_io_buffer(
Nigel Taofdac24a2020-03-06 21:53:08 +1100644 wuffs_base__make_slice_u8(src_array, SRC_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100645 wuffs_base__empty_io_buffer_meta());
646
647 tok = wuffs_base__make_token_buffer(
Nigel Taofdac24a2020-03-06 21:53:08 +1100648 wuffs_base__make_slice_token(tok_array, TOKEN_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100649 wuffs_base__empty_token_buffer_meta());
650
651 curr_token_end_src_index = 0;
652
Nigel Tao2cf76db2020-02-27 22:42:01 +1100653 depth = 0;
654
655 ctx = context::none;
656
Nigel Tao68920952020-03-03 11:25:18 +1100657 TRY(parse_flags(argc, argv));
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100658 if (flags.fail_if_unsandboxed && !sandboxed) {
659 return "main: unsandboxed";
660 }
Nigel Tao01abc842020-03-06 21:42:33 +1100661 const int stdin_fd = 0;
662 if (flags.remaining_argc > ((input_file_descriptor != stdin_fd) ? 1 : 0)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100663 return usage;
Nigel Tao107f0ef2020-03-01 21:35:02 +1100664 }
665
Nigel Tao0cd2f982020-03-03 23:03:02 +1100666 query.reset(flags.query_c_string);
667
668 // If the query is non-empty, suprress writing to stdout until we've
669 // completed the query.
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100670 suppress_write_dst = query.next_fragment() ? 1 : 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100671 wrote_to_dst = false;
672
Nigel Tao4b186b02020-03-18 14:25:21 +1100673 TRY(dec.initialize(sizeof__wuffs_json__decoder(), WUFFS_VERSION, 0)
674 .message());
675
676 // Consume an optional whitespace trailer. This isn't part of the JSON spec,
677 // but it works better with line oriented Unix tools (such as "echo 123 |
678 // jsonptr" where it's "echo", not "echo -n") or hand-edited JSON files which
679 // can accidentally contain trailing whitespace.
680 dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE, true);
681
682 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100683}
Nigel Tao1b073492020-02-16 22:11:36 +1100684
685// ----
686
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100687// ignore_return_value suppresses errors from -Wall -Werror.
688static void //
689ignore_return_value(int ignored) {}
690
Nigel Tao2914bae2020-02-26 09:40:30 +1100691const char* //
692read_src() {
Nigel Taoa8406922020-02-19 12:22:00 +1100693 if (src.meta.closed) {
Nigel Tao9cc2c252020-02-23 17:05:49 +1100694 return "main: internal error: read requested on a closed source";
Nigel Taoa8406922020-02-19 12:22:00 +1100695 }
Nigel Tao1b073492020-02-16 22:11:36 +1100696 src.compact();
697 if (src.meta.wi >= src.data.len) {
698 return "main: src buffer is full";
699 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100700 while (true) {
Nigel Tao01abc842020-03-06 21:42:33 +1100701 ssize_t n = read(input_file_descriptor, src.data.ptr + src.meta.wi,
702 src.data.len - src.meta.wi);
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100703 if (n >= 0) {
704 src.meta.wi += n;
705 src.meta.closed = n == 0;
706 break;
707 } else if (errno != EINTR) {
708 return strerror(errno);
709 }
Nigel Tao1b073492020-02-16 22:11:36 +1100710 }
711 return nullptr;
712}
713
Nigel Tao2914bae2020-02-26 09:40:30 +1100714const char* //
715flush_dst() {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100716 while (true) {
717 size_t n = dst.meta.wi - dst.meta.ri;
718 if (n == 0) {
719 break;
Nigel Tao1b073492020-02-16 22:11:36 +1100720 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100721 const int stdout_fd = 1;
722 ssize_t i = write(stdout_fd, dst.data.ptr + dst.meta.ri, n);
723 if (i >= 0) {
724 dst.meta.ri += i;
725 } else if (errno != EINTR) {
726 return strerror(errno);
727 }
Nigel Tao1b073492020-02-16 22:11:36 +1100728 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100729 dst.compact();
Nigel Tao1b073492020-02-16 22:11:36 +1100730 return nullptr;
731}
732
Nigel Tao2914bae2020-02-26 09:40:30 +1100733const char* //
734write_dst(const void* s, size_t n) {
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100735 if (suppress_write_dst > 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100736 return nullptr;
737 }
Nigel Tao1b073492020-02-16 22:11:36 +1100738 const uint8_t* p = static_cast<const uint8_t*>(s);
739 while (n > 0) {
740 size_t i = dst.writer_available();
741 if (i == 0) {
742 const char* z = flush_dst();
743 if (z) {
744 return z;
745 }
746 i = dst.writer_available();
747 if (i == 0) {
748 return "main: dst buffer is full";
749 }
750 }
751
752 if (i > n) {
753 i = n;
754 }
755 memcpy(dst.data.ptr + dst.meta.wi, p, i);
756 dst.meta.wi += i;
757 p += i;
758 n -= i;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100759 wrote_to_dst = true;
Nigel Tao1b073492020-02-16 22:11:36 +1100760 }
761 return nullptr;
762}
763
764// ----
765
Nigel Tao2914bae2020-02-26 09:40:30 +1100766uint8_t //
767hex_digit(uint8_t nibble) {
Nigel Taob5461bd2020-02-21 14:13:37 +1100768 nibble &= 0x0F;
769 if (nibble <= 9) {
770 return '0' + nibble;
771 }
772 return ('A' - 10) + nibble;
773}
774
Nigel Tao2914bae2020-02-26 09:40:30 +1100775const char* //
Nigel Tao3b486982020-02-27 15:05:59 +1100776handle_unicode_code_point(uint32_t ucp) {
777 if (ucp < 0x0020) {
778 switch (ucp) {
779 case '\b':
780 return write_dst("\\b", 2);
781 case '\f':
782 return write_dst("\\f", 2);
783 case '\n':
784 return write_dst("\\n", 2);
785 case '\r':
786 return write_dst("\\r", 2);
787 case '\t':
788 return write_dst("\\t", 2);
789 default: {
790 // Other bytes less than 0x0020 are valid UTF-8 but not valid in a
791 // JSON string. They need to remain escaped.
792 uint8_t esc6[6];
793 esc6[0] = '\\';
794 esc6[1] = 'u';
795 esc6[2] = '0';
796 esc6[3] = '0';
797 esc6[4] = hex_digit(ucp >> 4);
798 esc6[5] = hex_digit(ucp >> 0);
799 return write_dst(&esc6[0], 6);
800 }
801 }
802
Nigel Taob9ad34f2020-03-03 12:44:01 +1100803 } else if (ucp == '\"') {
804 return write_dst("\\\"", 2);
805
806 } else if (ucp == '\\') {
807 return write_dst("\\\\", 2);
808
809 } else {
810 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
811 size_t n = wuffs_base__utf_8__encode(
812 wuffs_base__make_slice_u8(&u[0],
813 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
814 ucp);
815 if (n > 0) {
816 return write_dst(&u[0], n);
Nigel Tao3b486982020-02-27 15:05:59 +1100817 }
Nigel Tao3b486982020-02-27 15:05:59 +1100818 }
819
Nigel Tao2cf76db2020-02-27 22:42:01 +1100820 return "main: internal error: unexpected Unicode code point";
Nigel Tao3b486982020-02-27 15:05:59 +1100821}
822
823const char* //
Nigel Tao2cf76db2020-02-27 22:42:01 +1100824handle_token(wuffs_base__token t) {
825 do {
826 uint64_t vbc = t.value_base_category();
827 uint64_t vbd = t.value_base_detail();
828 uint64_t len = t.length();
Nigel Tao1b073492020-02-16 22:11:36 +1100829
830 // Handle ']' or '}'.
Nigel Tao9f7a2502020-02-23 09:42:02 +1100831 if ((vbc == WUFFS_BASE__TOKEN__VBC__STRUCTURE) &&
Nigel Tao2cf76db2020-02-27 22:42:01 +1100832 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__POP)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100833 if (query.is_at(depth)) {
834 return "main: no match for query";
835 }
Nigel Tao1b073492020-02-16 22:11:36 +1100836 if (depth <= 0) {
837 return "main: internal error: inconsistent depth";
838 }
839 depth--;
840
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100841 if (query.matched_all() && (depth >= flags.max_output_depth)) {
842 suppress_write_dst--;
843 // '…' is U+2026 HORIZONTAL ELLIPSIS, which is 3 UTF-8 bytes.
844 TRY(write_dst((vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST)
845 ? "\"[…]\""
846 : "\"{…}\"",
847 7));
848 } else {
849 // Write preceding whitespace.
850 if ((ctx != context::in_list_after_bracket) &&
Nigel Tao3690e832020-03-12 16:52:26 +1100851 (ctx != context::in_dict_after_brace) && !flags.compact_output) {
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100852 TRY(write_dst("\n", 1));
853 for (uint32_t i = 0; i < depth; i++) {
854 TRY(write_dst(flags.tabs ? INDENT_TAB_STRING : INDENT_SPACES_STRING,
855 flags.tabs ? 1 : flags.indent));
856 }
Nigel Tao1b073492020-02-16 22:11:36 +1100857 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100858
859 TRY(write_dst(
860 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST) ? "]" : "}",
861 1));
Nigel Tao1b073492020-02-16 22:11:36 +1100862 }
863
Nigel Tao9f7a2502020-02-23 09:42:02 +1100864 ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
865 ? context::in_list_after_value
866 : context::in_dict_after_key;
Nigel Tao1b073492020-02-16 22:11:36 +1100867 goto after_value;
868 }
869
Nigel Taod1c928a2020-02-28 12:43:53 +1100870 // Write preceding whitespace and punctuation, if it wasn't ']', '}' or a
871 // continuation of a multi-token chain.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100872 if (!t.link_prev()) {
873 if (ctx == context::in_dict_after_key) {
Nigel Tao3690e832020-03-12 16:52:26 +1100874 TRY(write_dst(": ", flags.compact_output ? 1 : 2));
Nigel Tao0cd2f982020-03-03 23:03:02 +1100875 } else if (ctx != context::none) {
876 if ((ctx != context::in_list_after_bracket) &&
877 (ctx != context::in_dict_after_brace)) {
878 TRY(write_dst(",", 1));
Nigel Tao107f0ef2020-03-01 21:35:02 +1100879 }
Nigel Tao3690e832020-03-12 16:52:26 +1100880 if (!flags.compact_output) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100881 TRY(write_dst("\n", 1));
882 for (size_t i = 0; i < depth; i++) {
Nigel Tao6e7d1412020-03-06 09:21:35 +1100883 TRY(write_dst(flags.tabs ? INDENT_TAB_STRING : INDENT_SPACES_STRING,
884 flags.tabs ? 1 : flags.indent));
Nigel Tao0cd2f982020-03-03 23:03:02 +1100885 }
886 }
887 }
888
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100889 bool query_matched_fragment = false;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100890 if (query.is_at(depth)) {
891 switch (ctx) {
892 case context::in_list_after_bracket:
893 case context::in_list_after_value:
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100894 query_matched_fragment = query.tick();
Nigel Tao0cd2f982020-03-03 23:03:02 +1100895 break;
896 case context::in_dict_after_key:
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100897 query_matched_fragment = query.matched_fragment();
Nigel Tao0cd2f982020-03-03 23:03:02 +1100898 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +1100899 default:
900 break;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100901 }
902 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100903 if (!query_matched_fragment) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100904 // No-op.
905 } else if (!query.next_fragment()) {
906 // There is no next fragment. We have matched the complete query, and
907 // the upcoming JSON value is the result of that query.
908 //
909 // Un-suppress writing to stdout and reset the ctx and depth as if we
910 // were about to decode a top-level value. This makes any subsequent
911 // indentation be relative to this point, and we will return eod after
912 // the upcoming JSON value is complete.
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100913 if (suppress_write_dst != 1) {
914 return "main: internal error: inconsistent suppress_write_dst";
915 }
916 suppress_write_dst = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100917 ctx = context::none;
918 depth = 0;
919 } else if ((vbc != WUFFS_BASE__TOKEN__VBC__STRUCTURE) ||
920 !(vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__PUSH)) {
921 // The query has moved on to the next fragment but the upcoming JSON
922 // value is not a container.
923 return "main: no match for query";
Nigel Tao1b073492020-02-16 22:11:36 +1100924 }
925 }
926
927 // Handle the token itself: either a container ('[' or '{') or a simple
Nigel Tao85fba7f2020-02-29 16:28:06 +1100928 // value: string (a chain of raw or escaped parts), literal or number.
Nigel Tao1b073492020-02-16 22:11:36 +1100929 switch (vbc) {
Nigel Tao85fba7f2020-02-29 16:28:06 +1100930 case WUFFS_BASE__TOKEN__VBC__STRUCTURE:
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100931 if (query.matched_all() && (depth >= flags.max_output_depth)) {
932 suppress_write_dst++;
933 } else {
934 TRY(write_dst(
935 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST) ? "[" : "{",
936 1));
937 }
Nigel Tao85fba7f2020-02-29 16:28:06 +1100938 depth++;
939 ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
940 ? context::in_list_after_bracket
941 : context::in_dict_after_brace;
942 return nullptr;
943
Nigel Tao2cf76db2020-02-27 22:42:01 +1100944 case WUFFS_BASE__TOKEN__VBC__STRING:
Nigel Taod1c928a2020-02-28 12:43:53 +1100945 if (!t.link_prev()) {
Nigel Tao2cf76db2020-02-27 22:42:01 +1100946 TRY(write_dst("\"", 1));
Nigel Tao0cd2f982020-03-03 23:03:02 +1100947 query.restart_fragment(in_dict_before_key() && query.is_at(depth));
Nigel Tao2cf76db2020-02-27 22:42:01 +1100948 }
Nigel Taocb37a562020-02-28 09:56:24 +1100949
950 if (vbd & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_0_DST_1_SRC_DROP) {
951 // No-op.
952 } else if (vbd &
953 WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100954 uint8_t* ptr = src.data.ptr + curr_token_end_src_index - len;
955 TRY(write_dst(ptr, len));
956 query.incremental_match_slice(ptr, len);
Nigel Taocb37a562020-02-28 09:56:24 +1100957 } else {
958 return "main: internal error: unexpected string-token conversion";
959 }
960
Nigel Taod1c928a2020-02-28 12:43:53 +1100961 if (t.link_next()) {
Nigel Tao2cf76db2020-02-27 22:42:01 +1100962 return nullptr;
963 }
964 TRY(write_dst("\"", 1));
965 goto after_value;
966
967 case WUFFS_BASE__TOKEN__VBC__UNICODE_CODE_POINT:
Nigel Tao0cd2f982020-03-03 23:03:02 +1100968 if (!t.link_prev() || !t.link_next()) {
969 return "main: internal error: unexpected unlinked token";
970 }
971 TRY(handle_unicode_code_point(vbd));
972 query.incremental_match_code_point(vbd);
973 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100974
Nigel Tao85fba7f2020-02-29 16:28:06 +1100975 case WUFFS_BASE__TOKEN__VBC__LITERAL:
Nigel Tao2cf76db2020-02-27 22:42:01 +1100976 case WUFFS_BASE__TOKEN__VBC__NUMBER:
977 TRY(write_dst(src.data.ptr + curr_token_end_src_index - len, len));
978 goto after_value;
Nigel Tao1b073492020-02-16 22:11:36 +1100979 }
980
981 // Return an error if we didn't match the (vbc, vbd) pair.
Nigel Tao2cf76db2020-02-27 22:42:01 +1100982 return "main: internal error: unexpected token";
983 } while (0);
Nigel Tao1b073492020-02-16 22:11:36 +1100984
Nigel Tao2cf76db2020-02-27 22:42:01 +1100985 // Book-keeping after completing a value (whether a container value or a
986 // simple value). Empty parent containers are no longer empty. If the parent
987 // container is a "{...}" object, toggle between keys and values.
988after_value:
989 if (depth == 0) {
990 return eod;
991 }
992 switch (ctx) {
993 case context::in_list_after_bracket:
994 ctx = context::in_list_after_value;
995 break;
996 case context::in_dict_after_brace:
997 ctx = context::in_dict_after_key;
998 break;
999 case context::in_dict_after_key:
1000 ctx = context::in_dict_after_value;
1001 break;
1002 case context::in_dict_after_value:
1003 ctx = context::in_dict_after_key;
1004 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +11001005 default:
1006 break;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001007 }
1008 return nullptr;
1009}
1010
1011const char* //
1012main1(int argc, char** argv) {
1013 TRY(initialize_globals(argc, argv));
1014
1015 while (true) {
1016 wuffs_base__status status = dec.decode_tokens(&tok, &src);
1017
1018 while (tok.meta.ri < tok.meta.wi) {
1019 wuffs_base__token t = tok.data.ptr[tok.meta.ri++];
1020 uint64_t n = t.length();
1021 if ((src.meta.ri - curr_token_end_src_index) < n) {
1022 return "main: internal error: inconsistent src indexes";
1023 }
1024 curr_token_end_src_index += n;
1025
Nigel Taod0b16cb2020-03-14 10:15:54 +11001026 // Skip filler tokens (e.g. whitespace).
Nigel Tao2cf76db2020-02-27 22:42:01 +11001027 if (t.value() == 0) {
1028 continue;
1029 }
1030
1031 const char* z = handle_token(t);
1032 if (z == nullptr) {
1033 continue;
1034 } else if (z == eod) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001035 goto end_of_data;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001036 }
1037 return z;
Nigel Tao1b073492020-02-16 22:11:36 +11001038 }
Nigel Tao2cf76db2020-02-27 22:42:01 +11001039
1040 if (status.repr == nullptr) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001041 return "main: internal error: unexpected end of token stream";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001042 } else if (status.repr == wuffs_base__suspension__short_read) {
1043 if (curr_token_end_src_index != src.meta.ri) {
1044 return "main: internal error: inconsistent src indexes";
1045 }
1046 TRY(read_src());
1047 curr_token_end_src_index = src.meta.ri;
1048 } else if (status.repr == wuffs_base__suspension__short_write) {
1049 tok.compact();
1050 } else {
1051 return status.message();
Nigel Tao1b073492020-02-16 22:11:36 +11001052 }
1053 }
Nigel Tao0cd2f982020-03-03 23:03:02 +11001054end_of_data:
1055
1056 // With a non-empty query, don't try to consume trailing whitespace or
1057 // confirm that we've processed all the tokens.
1058 if (flags.query_c_string && *flags.query_c_string) {
1059 return nullptr;
1060 }
Nigel Tao6b161af2020-02-24 11:01:48 +11001061
Nigel Tao6b161af2020-02-24 11:01:48 +11001062 // Check that we've exhausted the input.
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001063 if ((src.meta.ri == src.meta.wi) && !src.meta.closed) {
1064 TRY(read_src());
1065 }
Nigel Tao6b161af2020-02-24 11:01:48 +11001066 if ((src.meta.ri < src.meta.wi) || !src.meta.closed) {
1067 return "main: valid JSON followed by further (unexpected) data";
1068 }
1069
1070 // Check that we've used all of the decoded tokens, other than trailing
Nigel Tao4b186b02020-03-18 14:25:21 +11001071 // filler tokens. For example, "true\n" is valid JSON (and fully consumed
1072 // with WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE enabled) with a trailing
1073 // filler token for the "\n".
Nigel Tao6b161af2020-02-24 11:01:48 +11001074 for (; tok.meta.ri < tok.meta.wi; tok.meta.ri++) {
1075 if (tok.data.ptr[tok.meta.ri].value_base_category() !=
1076 WUFFS_BASE__TOKEN__VBC__FILLER) {
1077 return "main: internal error: decoded OK but unprocessed tokens remain";
1078 }
1079 }
1080
1081 return nullptr;
Nigel Tao1b073492020-02-16 22:11:36 +11001082}
1083
Nigel Tao2914bae2020-02-26 09:40:30 +11001084int //
1085compute_exit_code(const char* status_msg) {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001086 if (!status_msg) {
1087 return 0;
1088 }
Nigel Tao01abc842020-03-06 21:42:33 +11001089 size_t n;
1090 if (status_msg == usage) {
1091 n = strlen(status_msg);
1092 } else {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001093 n = strnlen(status_msg, 2047);
Nigel Tao01abc842020-03-06 21:42:33 +11001094 if (n >= 2047) {
1095 status_msg = "main: internal error: error message is too long";
1096 n = strnlen(status_msg, 2047);
1097 }
Nigel Tao9cc2c252020-02-23 17:05:49 +11001098 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001099 const int stderr_fd = 2;
1100 ignore_return_value(write(stderr_fd, status_msg, n));
1101 ignore_return_value(write(stderr_fd, "\n", 1));
Nigel Tao9cc2c252020-02-23 17:05:49 +11001102 // Return an exit code of 1 for regular (forseen) errors, e.g. badly
1103 // formatted or unsupported input.
1104 //
1105 // Return an exit code of 2 for internal (exceptional) errors, e.g. defensive
1106 // run-time checks found that an internal invariant did not hold.
1107 //
1108 // Automated testing, including badly formatted inputs, can therefore
1109 // discriminate between expected failure (exit code 1) and unexpected failure
1110 // (other non-zero exit codes). Specifically, exit code 2 for internal
1111 // invariant violation, exit code 139 (which is 128 + SIGSEGV on x86_64
1112 // linux) for a segmentation fault (e.g. null pointer dereference).
1113 return strstr(status_msg, "internal error:") ? 2 : 1;
1114}
1115
Nigel Tao2914bae2020-02-26 09:40:30 +11001116int //
1117main(int argc, char** argv) {
Nigel Tao01abc842020-03-06 21:42:33 +11001118 // Look for an input filename (the first non-flag argument) in argv. If there
1119 // is one, open it (but do not read from it) before we self-impose a sandbox.
1120 //
1121 // Flags start with "-", unless it comes after a bare "--" arg.
1122 {
1123 bool dash_dash = false;
1124 int a;
1125 for (a = 1; a < argc; a++) {
1126 char* arg = argv[a];
1127 if ((arg[0] == '-') && !dash_dash) {
1128 dash_dash = (arg[1] == '-') && (arg[2] == '\x00');
1129 continue;
1130 }
1131 input_file_descriptor = open(arg, O_RDONLY);
1132 if (input_file_descriptor < 0) {
1133 fprintf(stderr, "%s: %s\n", arg, strerror(errno));
1134 return 1;
1135 }
1136 break;
1137 }
1138 }
1139
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001140#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1141 prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
1142 sandboxed = true;
1143#endif
1144
Nigel Tao0cd2f982020-03-03 23:03:02 +11001145 const char* z = main1(argc, argv);
1146 if (wrote_to_dst) {
1147 const char* z1 = write_dst("\n", 1);
1148 const char* z2 = flush_dst();
1149 z = z ? z : (z1 ? z1 : z2);
1150 }
1151 int exit_code = compute_exit_code(z);
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001152
1153#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1154 // Call SYS_exit explicitly, instead of calling SYS_exit_group implicitly by
1155 // either calling _exit or returning from main. SECCOMP_MODE_STRICT allows
1156 // only SYS_exit.
1157 syscall(SYS_exit, exit_code);
1158#endif
Nigel Tao9cc2c252020-02-23 17:05:49 +11001159 return exit_code;
Nigel Tao1b073492020-02-16 22:11:36 +11001160}