blob: b4768754a0e88121cee6a69b8074bebd0b9f393d [file] [log] [blame]
Nigel Tao1b073492020-02-16 22:11:36 +11001// Copyright 2020 The Wuffs Authors.
2//
3// Licensed under the Apache License, Version 2.0 (the "License");
4// you may not use this file except in compliance with the License.
5// You may obtain a copy of the License at
6//
7// https://www.apache.org/licenses/LICENSE-2.0
8//
9// Unless required by applicable law or agreed to in writing, software
10// distributed under the License is distributed on an "AS IS" BASIS,
11// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12// See the License for the specific language governing permissions and
13// limitations under the License.
14
15// ----------------
16
17/*
Nigel Tao0cd2f982020-03-03 23:03:02 +110018jsonptr is a JSON formatter (pretty-printer) that supports the JSON Pointer
Nigel Tao168f60a2020-07-14 13:19:33 +100019(RFC 6901) query syntax. It reads CBOR or UTF-8 JSON from stdin and writes CBOR
20or canonicalized, formatted UTF-8 JSON to stdout.
Nigel Tao0cd2f982020-03-03 23:03:02 +110021
Nigel Taod60815c2020-03-26 14:32:35 +110022See the "const char* g_usage" string below for details.
Nigel Tao0cd2f982020-03-03 23:03:02 +110023
24----
25
26JSON Pointer (and this program's implementation) is one of many JSON query
27languages and JSON tools, such as jq, jql and JMESPath. This one is relatively
28simple and fewer-featured compared to those others.
29
Nigel Tao168f60a2020-07-14 13:19:33 +100030One benefit of simplicity is that this program's CBOR, JSON and JSON Pointer
Nigel Tao0cd2f982020-03-03 23:03:02 +110031implementations do not dynamically allocate or free memory (yet it does not
32require that the entire input fits in memory at once). They are therefore
33trivially protected against certain bug classes: memory leaks, double-frees and
34use-after-frees.
35
Nigel Tao168f60a2020-07-14 13:19:33 +100036The CBOR and JSON implementations are also written in the Wuffs programming
37language (and then transpiled to C/C++), which is memory-safe (e.g. array
38indexing is bounds-checked) but also prevents integer arithmetic overflows.
Nigel Tao0cd2f982020-03-03 23:03:02 +110039
Nigel Taofe0cbbd2020-03-05 22:01:30 +110040For defense in depth, on Linux, this program also self-imposes a
41SECCOMP_MODE_STRICT sandbox before reading (or otherwise processing) its input
42or writing its output. Under this sandbox, the only permitted system calls are
43read, write, exit and sigreturn.
44
Nigel Tao168f60a2020-07-14 13:19:33 +100045All together, this program aims to safely handle untrusted CBOR or JSON files
46without fear of security bugs such as remote code execution.
Nigel Tao0cd2f982020-03-03 23:03:02 +110047
48----
Nigel Tao1b073492020-02-16 22:11:36 +110049
Nigel Taoc5b3a9e2020-02-24 11:54:35 +110050As of 2020-02-24, this program passes all 318 "test_parsing" cases from the
51JSON test suite (https://github.com/nst/JSONTestSuite), an appendix to the
52"Parsing JSON is a Minefield" article (http://seriot.ch/parsing_json.php) that
53was first published on 2016-10-26 and updated on 2018-03-30.
54
Nigel Tao0cd2f982020-03-03 23:03:02 +110055After modifying this program, run "build-example.sh example/jsonptr/" and then
56"script/run-json-test-suite.sh" to catch correctness regressions.
57
58----
59
Nigel Taod0b16cb2020-03-14 10:15:54 +110060This program uses Wuffs' JSON decoder at a relatively low level, processing the
61decoder's token-stream output individually. The core loop, in pseudo-code, is
62"for_each_token { handle_token(etc); }", where the handle_token function
Nigel Taod60815c2020-03-26 14:32:35 +110063changes global state (e.g. the `g_depth` and `g_ctx` variables) and prints
Nigel Taod0b16cb2020-03-14 10:15:54 +110064output text based on that state and the token's source text. Notably,
65handle_token is not recursive, even though JSON values can nest.
66
67This approach is centered around JSON tokens. Each JSON 'thing' (e.g. number,
68string, object) comprises one or more JSON tokens.
69
70An alternative, higher-level approach is in the sibling example/jsonfindptrs
71program. Neither approach is better or worse per se, but when studying this
72program, be aware that there are multiple ways to use Wuffs' JSON decoder.
73
74The two programs, jsonfindptrs and jsonptr, also demonstrate different
75trade-offs with regard to JSON object duplicate keys. The JSON spec permits
76different implementations to allow or reject duplicate keys. It is not always
77clear which approach is safer. Rejecting them is certainly unambiguous, and
78security bugs can lurk in ambiguous corners of a file format, if two different
79implementations both silently accept a file but differ on how to interpret it.
80On the other hand, in the worst case, detecting duplicate keys requires O(N)
81memory, where N is the size of the (potentially untrusted) input.
82
83This program (jsonptr) allows duplicate keys and requires only O(1) memory. As
84mentioned above, it doesn't dynamically allocate memory at all, and on Linux,
85it runs in a SECCOMP_MODE_STRICT sandbox.
86
87----
88
Nigel Tao1b073492020-02-16 22:11:36 +110089This example program differs from most other example Wuffs programs in that it
90is written in C++, not C.
91
92$CXX jsonptr.cc && ./a.out < ../../test/data/github-tags.json; rm -f a.out
93
94for a C++ compiler $CXX, such as clang++ or g++.
95*/
96
Nigel Tao721190a2020-04-03 22:25:21 +110097#if defined(__cplusplus) && (__cplusplus < 201103L)
98#error "This C++ program requires -std=c++11 or later"
99#endif
100
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100101#include <errno.h>
Nigel Tao01abc842020-03-06 21:42:33 +1100102#include <fcntl.h>
103#include <stdio.h>
Nigel Tao9cc2c252020-02-23 17:05:49 +1100104#include <string.h>
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100105#include <unistd.h>
Nigel Tao1b073492020-02-16 22:11:36 +1100106
107// Wuffs ships as a "single file C library" or "header file library" as per
108// https://github.com/nothings/stb/blob/master/docs/stb_howto.txt
109//
110// To use that single file as a "foo.c"-like implementation, instead of a
111// "foo.h"-like header, #define WUFFS_IMPLEMENTATION before #include'ing or
112// compiling it.
113#define WUFFS_IMPLEMENTATION
114
115// Defining the WUFFS_CONFIG__MODULE* macros are optional, but it lets users of
116// release/c/etc.c whitelist which parts of Wuffs to build. That file contains
117// the entire Wuffs standard library, implementing a variety of codecs and file
118// formats. Without this macro definition, an optimizing compiler or linker may
119// very well discard Wuffs code for unused codecs, but listing the Wuffs
120// modules we use makes that process explicit. Preprocessing means that such
121// code simply isn't compiled.
122#define WUFFS_CONFIG__MODULES
123#define WUFFS_CONFIG__MODULE__BASE
124#define WUFFS_CONFIG__MODULE__JSON
125
126// If building this program in an environment that doesn't easily accommodate
127// relative includes, you can use the script/inline-c-relative-includes.go
128// program to generate a stand-alone C++ file.
129#include "../../release/c/wuffs-unsupported-snapshot.c"
130
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100131#if defined(__linux__)
132#include <linux/prctl.h>
133#include <linux/seccomp.h>
134#include <sys/prctl.h>
135#include <sys/syscall.h>
136#define WUFFS_EXAMPLE_USE_SECCOMP
137#endif
138
Nigel Tao2cf76db2020-02-27 22:42:01 +1100139#define TRY(error_msg) \
140 do { \
141 const char* z = error_msg; \
142 if (z) { \
143 return z; \
144 } \
145 } while (false)
146
Nigel Taod60815c2020-03-26 14:32:35 +1100147static const char* g_eod = "main: end of data";
Nigel Tao2cf76db2020-02-27 22:42:01 +1100148
Nigel Taod60815c2020-03-26 14:32:35 +1100149static const char* g_usage =
Nigel Tao01abc842020-03-06 21:42:33 +1100150 "Usage: jsonptr -flags input.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100151 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100152 "Flags:\n"
Nigel Tao3690e832020-03-12 16:52:26 +1100153 " -c -compact-output\n"
Nigel Tao94440cf2020-04-02 22:28:24 +1100154 " -d=NUM -max-output-depth=NUM\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000155 " -o=FMT -output-format={json,cbor}\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100156 " -q=STR -query=STR\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000157 " -s=NUM -spaces=NUM\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100158 " -t -tabs\n"
159 " -fail-if-unsandboxed\n"
Nigel Taoc766bb72020-07-09 12:59:32 +1000160 " -input-json-extra-comma\n"
161 " -output-json-extra-comma\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000162 " -strict-json-pointer-syntax\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100163 "\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100164 "The input.json filename is optional. If absent, it reads from stdin.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100165 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100166 "----\n"
167 "\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100168 "jsonptr is a JSON formatter (pretty-printer) that supports the JSON\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000169 "Pointer (RFC 6901) query syntax. It reads CBOR or UTF-8 JSON from stdin\n"
170 "and writes CBOR or canonicalized, formatted UTF-8 JSON to stdout. The\n"
171 "input and output formats do not have to match, but conversion between\n"
172 "formats may be lossy.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100173 "\n"
174 "Canonicalized means that e.g. \"abc\\u000A\\tx\\u0177z\" is re-written\n"
175 "as \"abc\\n\\txÅ·z\". It does not sort object keys, nor does it reject\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100176 "duplicate keys. Canonicalization does not imply Unicode normalization.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100177 "\n"
178 "Formatted means that arrays' and objects' elements are indented, each\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000179 "on its own line. Configure this with the -c / -compact-output, -s=NUM /\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000180 "-spaces=NUM (for NUM ranging from 0 to 8) and -t / -tabs flags. Those\n"
181 "flags only apply to JSON (not CBOR) output.\n"
182 "\n"
183 "The -input-format and -output-format flags select between reading and\n"
184 "writing JSON (the default, a textual format) or CBOR (a binary format).\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100185 "\n"
Nigel Taoc766bb72020-07-09 12:59:32 +1000186 "The -input-json-extra-comma flag allows input like \"[1,2,]\", with a\n"
187 "comma after the final element of a JSON list or dictionary.\n"
188 "\n"
189 "The -output-json-extra-comma flag writes extra commas, regardless of\n"
190 "whether the input had it. Extra commas are non-compliant with the JSON\n"
191 "specification but many parsers accept it and it can produce simpler\n"
192 "line-based diffs. This flag is ignored when -compact-output is set.\n"
193 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100194 "----\n"
195 "\n"
196 "The -q=STR or -query=STR flag gives an optional JSON Pointer query, to\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100197 "print a subset of the input. For example, given RFC 6901 section 5's\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100198 "sample input (https://tools.ietf.org/rfc/rfc6901.txt), this command:\n"
199 " jsonptr -query=/foo/1 rfc-6901-json-pointer.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100200 "will print:\n"
201 " \"baz\"\n"
202 "\n"
203 "An absent query is equivalent to the empty query, which identifies the\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100204 "entire input (the root value). Unlike a file system, the \"/\" query\n"
Nigel Taod0b16cb2020-03-14 10:15:54 +1100205 "does not identify the root. Instead, \"\" is the root and \"/\" is the\n"
206 "child (the value in a key-value pair) of the root whose key is the empty\n"
207 "string. Similarly, \"/xyz\" and \"/xyz/\" are two different nodes.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100208 "\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000209 "If the query found a valid JSON|CBOR value, this program will return a\n"
210 "zero exit code even if the rest of the input isn't valid. If the query\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100211 "did not find a value, or found an invalid one, this program returns a\n"
212 "non-zero exit code, but may still print partial output to stdout.\n"
213 "\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000214 "The JSON and CBOR specifications (https://json.org/ or RFC 8259; RFC\n"
215 "7049) permit implementations to allow duplicate keys, as this one does.\n"
216 "This JSON Pointer implementation is also greedy, following the first\n"
217 "match for each fragment without back-tracking. For example, the\n"
218 "\"/foo/bar\" query will fail if the root object has multiple \"foo\"\n"
219 "children but the first one doesn't have a \"bar\" child, even if later\n"
220 "ones do.\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100221 "\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000222 "The -strict-json-pointer-syntax flag restricts the -query=STR string to\n"
223 "exactly RFC 6901, with only two escape sequences: \"~0\" and \"~1\" for\n"
224 "\"~\" and \"/\". Without this flag, this program also lets \"~n\" and\n"
225 "\"~r\" escape the New Line and Carriage Return ASCII control characters,\n"
226 "which can work better with line oriented Unix tools that assume exactly\n"
227 "one value (i.e. one JSON Pointer string) per line.\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100228 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100229 "----\n"
230 "\n"
Nigel Tao94440cf2020-04-02 22:28:24 +1100231 "The -d=NUM or -max-output-depth=NUM flag gives the maximum (inclusive)\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000232 "output depth. JSON|CBOR containers ([] arrays and {} objects) can hold\n"
233 "other containers. When this flag is set, containers at depth NUM are\n"
234 "replaced with \"[…]\" or \"{…}\". A bare -d or -max-output-depth is\n"
235 "equivalent to -d=1. The flag's absence means an unlimited output depth.\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100236 "\n"
237 "The -max-output-depth flag only affects the program's output. It doesn't\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000238 "affect whether or not the input is considered valid JSON|CBOR. The\n"
239 "format specifications permit implementations to set their own maximum\n"
240 "input depth. This JSON|CBOR implementation sets it to 1024.\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100241 "\n"
242 "Depth is measured in terms of nested containers. It is unaffected by the\n"
243 "number of spaces or tabs used to indent.\n"
244 "\n"
245 "When both -max-output-depth and -query are set, the output depth is\n"
246 "measured from when the query resolves, not from the input root. The\n"
247 "input depth (measured from the root) is still limited to 1024.\n"
248 "\n"
249 "----\n"
250 "\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100251 "The -fail-if-unsandboxed flag causes the program to exit if it does not\n"
252 "self-impose a sandbox. On Linux, it self-imposes a SECCOMP_MODE_STRICT\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100253 "sandbox, regardless of whether this flag was set.";
Nigel Tao0cd2f982020-03-03 23:03:02 +1100254
Nigel Tao2cf76db2020-02-27 22:42:01 +1100255// ----
256
Nigel Taof3146c22020-03-26 08:47:42 +1100257// Wuffs allows either statically or dynamically allocated work buffers. This
258// program exercises static allocation.
259#define WORK_BUFFER_ARRAY_SIZE \
260 WUFFS_JSON__DECODER_WORKBUF_LEN_MAX_INCL_WORST_CASE
261#if WORK_BUFFER_ARRAY_SIZE > 0
Nigel Taod60815c2020-03-26 14:32:35 +1100262uint8_t g_work_buffer_array[WORK_BUFFER_ARRAY_SIZE];
Nigel Taof3146c22020-03-26 08:47:42 +1100263#else
264// Not all C/C++ compilers support 0-length arrays.
Nigel Taod60815c2020-03-26 14:32:35 +1100265uint8_t g_work_buffer_array[1];
Nigel Taof3146c22020-03-26 08:47:42 +1100266#endif
267
Nigel Taod60815c2020-03-26 14:32:35 +1100268bool g_sandboxed = false;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100269
Nigel Taod60815c2020-03-26 14:32:35 +1100270int g_input_file_descriptor = 0; // A 0 default means stdin.
Nigel Tao01abc842020-03-06 21:42:33 +1100271
Nigel Tao2cf76db2020-02-27 22:42:01 +1100272#define MAX_INDENT 8
Nigel Tao107f0ef2020-03-01 21:35:02 +1100273#define INDENT_SPACES_STRING " "
Nigel Tao6e7d1412020-03-06 09:21:35 +1100274#define INDENT_TAB_STRING "\t"
Nigel Tao107f0ef2020-03-01 21:35:02 +1100275
Nigel Taofdac24a2020-03-06 21:53:08 +1100276#ifndef DST_BUFFER_ARRAY_SIZE
277#define DST_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100278#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100279#ifndef SRC_BUFFER_ARRAY_SIZE
280#define SRC_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100281#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100282#ifndef TOKEN_BUFFER_ARRAY_SIZE
283#define TOKEN_BUFFER_ARRAY_SIZE (4 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100284#endif
285
Nigel Taod60815c2020-03-26 14:32:35 +1100286uint8_t g_dst_array[DST_BUFFER_ARRAY_SIZE];
287uint8_t g_src_array[SRC_BUFFER_ARRAY_SIZE];
288wuffs_base__token g_tok_array[TOKEN_BUFFER_ARRAY_SIZE];
Nigel Tao1b073492020-02-16 22:11:36 +1100289
Nigel Taod60815c2020-03-26 14:32:35 +1100290wuffs_base__io_buffer g_dst;
291wuffs_base__io_buffer g_src;
292wuffs_base__token_buffer g_tok;
Nigel Tao1b073492020-02-16 22:11:36 +1100293
Nigel Taod60815c2020-03-26 14:32:35 +1100294// g_curr_token_end_src_index is the g_src.data.ptr index of the end of the
295// current token. An invariant is that (g_curr_token_end_src_index <=
296// g_src.meta.ri).
297size_t g_curr_token_end_src_index;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100298
Nigel Taod60815c2020-03-26 14:32:35 +1100299uint32_t g_depth;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100300
301enum class context {
302 none,
303 in_list_after_bracket,
304 in_list_after_value,
305 in_dict_after_brace,
306 in_dict_after_key,
307 in_dict_after_value,
Nigel Taod60815c2020-03-26 14:32:35 +1100308} g_ctx;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100309
Nigel Tao0cd2f982020-03-03 23:03:02 +1100310bool //
311in_dict_before_key() {
Nigel Taod60815c2020-03-26 14:32:35 +1100312 return (g_ctx == context::in_dict_after_brace) ||
313 (g_ctx == context::in_dict_after_value);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100314}
315
Nigel Taod60815c2020-03-26 14:32:35 +1100316uint32_t g_suppress_write_dst;
317bool g_wrote_to_dst;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100318
Nigel Taod60815c2020-03-26 14:32:35 +1100319wuffs_json__decoder g_dec;
Nigel Tao1b073492020-02-16 22:11:36 +1100320
Nigel Tao168f60a2020-07-14 13:19:33 +1000321// cbor_output_string_array is a 4 KiB buffer. For -output-format=cbor, strings
322// whose length are 4096 or less are written as a single definite-length
323// string. Longer strings are written as an indefinite-length string containing
324// multiple definite-length chunks, each of length up to 4 KiB. See the CBOR
325// RFC (RFC 7049) section 2.2.2 "Indefinite-Length Byte Strings and Text
326// Strings". The output is determinate even when the input is streamed.
327//
328// If raising CBOR_OUTPUT_STRING_ARRAY_SIZE above 0xFFFF then you will also
329// have to update flush_cbor_output_string.
330#define CBOR_OUTPUT_STRING_ARRAY_SIZE 4096
331uint8_t g_cbor_output_string_array[CBOR_OUTPUT_STRING_ARRAY_SIZE];
332
333uint32_t g_cbor_output_string_length;
334bool g_cbor_output_string_is_multiple_chunks;
335bool g_cbor_output_string_is_utf_8;
336
Nigel Tao0cd2f982020-03-03 23:03:02 +1100337// ----
338
339// Query is a JSON Pointer query. After initializing with a NUL-terminated C
340// string, its multiple fragments are consumed as the program walks the JSON
341// data from stdin. For example, letting "$" denote a NUL, suppose that we
342// started with a query string of "/apple/banana/12/durian" and are currently
Nigel Taob48ee752020-03-13 09:27:33 +1100343// trying to match the second fragment, "banana", so that Query::m_depth is 2:
Nigel Tao0cd2f982020-03-03 23:03:02 +1100344//
345// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
346// / a p p l e / b a n a n a / 1 2 / d u r i a n $
347// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
348// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100349// m_frag_i m_frag_k
Nigel Tao0cd2f982020-03-03 23:03:02 +1100350//
Nigel Taob48ee752020-03-13 09:27:33 +1100351// The two pointers m_frag_i and m_frag_k (abbreviated as mfi and mfk) are the
352// start (inclusive) and end (exclusive) of the query fragment. They satisfy
353// (mfi <= mfk) and may be equal if the fragment empty (note that "" is a valid
354// JSON object key).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100355//
Nigel Taob48ee752020-03-13 09:27:33 +1100356// The m_frag_j (mfj) pointer moves between these two, or is nullptr. An
357// invariant is that (((mfi <= mfj) && (mfj <= mfk)) || (mfj == nullptr)).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100358//
359// Wuffs' JSON tokenizer can portray a single JSON string as multiple Wuffs
360// tokens, as backslash-escaped values within that JSON string may each get
361// their own token.
362//
Nigel Taob48ee752020-03-13 09:27:33 +1100363// At the start of each object key (a JSON string), mfj is set to mfi.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100364//
Nigel Taob48ee752020-03-13 09:27:33 +1100365// While mfj remains non-nullptr, each token's unescaped contents are then
366// compared to that part of the fragment from mfj to mfk. If it is a prefix
367// (including the case of an exact match), then mfj is advanced by the
368// unescaped length. Otherwise, mfj is set to nullptr.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100369//
370// Comparison accounts for JSON Pointer's escaping notation: "~0" and "~1" in
371// the query (not the JSON value) are unescaped to "~" and "/" respectively.
Nigel Taob48ee752020-03-13 09:27:33 +1100372// "~n" and "~r" are also unescaped to "\n" and "\r". The program is
373// responsible for calling Query::validate (with a strict_json_pointer_syntax
374// argument) before otherwise using this class.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100375//
Nigel Taob48ee752020-03-13 09:27:33 +1100376// The mfj pointer therefore advances from mfi to mfk, or drops out, as we
377// incrementally match the object key with the query fragment. For example, if
378// we have already matched the "ban" of "banana", then we would accept any of
379// an "ana" token, an "a" token or a "\u0061" token, amongst others. They would
380// advance mfj by 3, 1 or 1 bytes respectively.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100381//
Nigel Taob48ee752020-03-13 09:27:33 +1100382// mfj
Nigel Tao0cd2f982020-03-03 23:03:02 +1100383// v
384// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
385// / a p p l e / b a n a n a / 1 2 / d u r i a n $
386// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
387// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100388// mfi mfk
Nigel Tao0cd2f982020-03-03 23:03:02 +1100389//
390// At the end of each object key (or equivalently, at the start of each object
Nigel Taob48ee752020-03-13 09:27:33 +1100391// value), if mfj is non-nullptr and equal to (but not less than) mfk then we
392// have a fragment match: the query fragment equals the object key. If there is
393// a next fragment (in this example, "12") we move the frag_etc pointers to its
394// start and end and increment Query::m_depth. Otherwise, we have matched the
395// complete query, and the upcoming JSON value is the result of that query.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100396//
397// The discussion above centers on object keys. If the query fragment is
398// numeric then it can also match as an array index: the string fragment "12"
399// will match an array's 13th element (starting counting from zero). See RFC
400// 6901 for its precise definition of an "array index" number.
401//
Nigel Taob48ee752020-03-13 09:27:33 +1100402// Array index fragment match is represented by the Query::m_array_index field,
Nigel Tao0cd2f982020-03-03 23:03:02 +1100403// whose type (wuffs_base__result_u64) is a result type. An error result means
404// that the fragment is not an array index. A value result holds the number of
405// list elements remaining. When matching a query fragment in an array (instead
406// of in an object), each element ticks this number down towards zero. At zero,
407// the upcoming JSON value is the one that matches the query fragment.
408class Query {
409 private:
Nigel Taob48ee752020-03-13 09:27:33 +1100410 uint8_t* m_frag_i;
411 uint8_t* m_frag_j;
412 uint8_t* m_frag_k;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100413
Nigel Taob48ee752020-03-13 09:27:33 +1100414 uint32_t m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100415
Nigel Taob48ee752020-03-13 09:27:33 +1100416 wuffs_base__result_u64 m_array_index;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100417
418 public:
419 void reset(char* query_c_string) {
Nigel Taob48ee752020-03-13 09:27:33 +1100420 m_frag_i = (uint8_t*)query_c_string;
421 m_frag_j = (uint8_t*)query_c_string;
422 m_frag_k = (uint8_t*)query_c_string;
423 m_depth = 0;
424 m_array_index.status.repr = "#main: not an array index query fragment";
425 m_array_index.value = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100426 }
427
Nigel Taob48ee752020-03-13 09:27:33 +1100428 void restart_fragment(bool enable) { m_frag_j = enable ? m_frag_i : nullptr; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100429
Nigel Taob48ee752020-03-13 09:27:33 +1100430 bool is_at(uint32_t depth) { return m_depth == depth; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100431
432 // tick returns whether the fragment is a valid array index whose value is
433 // zero. If valid but non-zero, it decrements it and returns false.
434 bool tick() {
Nigel Taob48ee752020-03-13 09:27:33 +1100435 if (m_array_index.status.is_ok()) {
436 if (m_array_index.value == 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100437 return true;
438 }
Nigel Taob48ee752020-03-13 09:27:33 +1100439 m_array_index.value--;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100440 }
441 return false;
442 }
443
444 // next_fragment moves to the next fragment, returning whether it existed.
445 bool next_fragment() {
Nigel Taob48ee752020-03-13 09:27:33 +1100446 uint8_t* k = m_frag_k;
447 uint32_t d = m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100448
449 this->reset(nullptr);
450
451 if (!k || (*k != '/')) {
452 return false;
453 }
454 k++;
455
456 bool all_digits = true;
457 uint8_t* i = k;
458 while ((*k != '\x00') && (*k != '/')) {
459 all_digits = all_digits && ('0' <= *k) && (*k <= '9');
460 k++;
461 }
Nigel Taob48ee752020-03-13 09:27:33 +1100462 m_frag_i = i;
463 m_frag_j = i;
464 m_frag_k = k;
465 m_depth = d + 1;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100466 if (all_digits) {
467 // wuffs_base__parse_number_u64 rejects leading zeroes, e.g. "00", "07".
Nigel Tao6b7ce302020-07-07 16:19:46 +1000468 m_array_index = wuffs_base__parse_number_u64(
469 wuffs_base__make_slice_u8(i, k - i),
470 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100471 }
472 return true;
473 }
474
Nigel Taob48ee752020-03-13 09:27:33 +1100475 bool matched_all() { return m_frag_k == nullptr; }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100476
Nigel Taob48ee752020-03-13 09:27:33 +1100477 bool matched_fragment() { return m_frag_j && (m_frag_j == m_frag_k); }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100478
479 void incremental_match_slice(uint8_t* ptr, size_t len) {
Nigel Taob48ee752020-03-13 09:27:33 +1100480 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100481 return;
482 }
Nigel Taob48ee752020-03-13 09:27:33 +1100483 uint8_t* j = m_frag_j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100484 while (true) {
485 if (len == 0) {
Nigel Taob48ee752020-03-13 09:27:33 +1100486 m_frag_j = j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100487 return;
488 }
489
490 if (*j == '\x00') {
491 break;
492
493 } else if (*j == '~') {
494 j++;
495 if (*j == '0') {
496 if (*ptr != '~') {
497 break;
498 }
499 } else if (*j == '1') {
500 if (*ptr != '/') {
501 break;
502 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100503 } else if (*j == 'n') {
504 if (*ptr != '\n') {
505 break;
506 }
507 } else if (*j == 'r') {
508 if (*ptr != '\r') {
509 break;
510 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100511 } else {
512 break;
513 }
514
515 } else if (*j != *ptr) {
516 break;
517 }
518
519 j++;
520 ptr++;
521 len--;
522 }
Nigel Taob48ee752020-03-13 09:27:33 +1100523 m_frag_j = nullptr;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100524 }
525
526 void incremental_match_code_point(uint32_t code_point) {
Nigel Taob48ee752020-03-13 09:27:33 +1100527 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100528 return;
529 }
530 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
531 size_t n = wuffs_base__utf_8__encode(
532 wuffs_base__make_slice_u8(&u[0],
533 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
534 code_point);
535 if (n > 0) {
536 this->incremental_match_slice(&u[0], n);
537 }
538 }
539
540 // validate returns whether the (ptr, len) arguments form a valid JSON
541 // Pointer. In particular, it must be valid UTF-8, and either be empty or
542 // start with a '/'. Any '~' within must immediately be followed by either
Nigel Taod6fdfb12020-03-11 12:24:14 +1100543 // '0' or '1'. If strict_json_pointer_syntax is false, a '~' may also be
544 // followed by either 'n' or 'r'.
545 static bool validate(char* query_c_string,
546 size_t length,
547 bool strict_json_pointer_syntax) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100548 if (length <= 0) {
549 return true;
550 }
551 if (query_c_string[0] != '/') {
552 return false;
553 }
554 wuffs_base__slice_u8 s =
555 wuffs_base__make_slice_u8((uint8_t*)query_c_string, length);
556 bool previous_was_tilde = false;
557 while (s.len > 0) {
558 wuffs_base__utf_8__next__output o = wuffs_base__utf_8__next(s);
559 if (!o.is_valid()) {
560 return false;
561 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100562
563 if (previous_was_tilde) {
564 switch (o.code_point) {
565 case '0':
566 case '1':
567 break;
568 case 'n':
569 case 'r':
570 if (strict_json_pointer_syntax) {
571 return false;
572 }
573 break;
574 default:
575 return false;
576 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100577 }
578 previous_was_tilde = o.code_point == '~';
Nigel Taod6fdfb12020-03-11 12:24:14 +1100579
Nigel Tao0cd2f982020-03-03 23:03:02 +1100580 s.ptr += o.byte_length;
581 s.len -= o.byte_length;
582 }
583 return !previous_was_tilde;
584 }
Nigel Taod60815c2020-03-26 14:32:35 +1100585} g_query;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100586
587// ----
588
Nigel Tao168f60a2020-07-14 13:19:33 +1000589enum class file_format {
590 json,
591 cbor,
592};
593
Nigel Tao68920952020-03-03 11:25:18 +1100594struct {
595 int remaining_argc;
596 char** remaining_argv;
597
Nigel Tao3690e832020-03-12 16:52:26 +1100598 bool compact_output;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100599 bool fail_if_unsandboxed;
Nigel Taoc766bb72020-07-09 12:59:32 +1000600 bool input_json_extra_comma;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100601 uint32_t max_output_depth;
Nigel Tao168f60a2020-07-14 13:19:33 +1000602 file_format output_format;
Nigel Taoc766bb72020-07-09 12:59:32 +1000603 bool output_json_extra_comma;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100604 char* query_c_string;
Nigel Taoecadf722020-07-13 08:22:34 +1000605 size_t spaces;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100606 bool strict_json_pointer_syntax;
Nigel Tao68920952020-03-03 11:25:18 +1100607 bool tabs;
Nigel Taod60815c2020-03-26 14:32:35 +1100608} g_flags = {0};
Nigel Tao68920952020-03-03 11:25:18 +1100609
610const char* //
611parse_flags(int argc, char** argv) {
Nigel Taoecadf722020-07-13 08:22:34 +1000612 g_flags.spaces = 4;
Nigel Taod60815c2020-03-26 14:32:35 +1100613 g_flags.max_output_depth = 0xFFFFFFFF;
Nigel Tao68920952020-03-03 11:25:18 +1100614
615 int c = (argc > 0) ? 1 : 0; // Skip argv[0], the program name.
616 for (; c < argc; c++) {
617 char* arg = argv[c];
618 if (*arg++ != '-') {
619 break;
620 }
621
622 // A double-dash "--foo" is equivalent to a single-dash "-foo". As special
623 // cases, a bare "-" is not a flag (some programs may interpret it as
624 // stdin) and a bare "--" means to stop parsing flags.
625 if (*arg == '\x00') {
626 break;
627 } else if (*arg == '-') {
628 arg++;
629 if (*arg == '\x00') {
630 c++;
631 break;
632 }
633 }
634
Nigel Tao3690e832020-03-12 16:52:26 +1100635 if (!strcmp(arg, "c") || !strcmp(arg, "compact-output")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100636 g_flags.compact_output = true;
Nigel Tao68920952020-03-03 11:25:18 +1100637 continue;
638 }
Nigel Tao94440cf2020-04-02 22:28:24 +1100639 if (!strcmp(arg, "d") || !strcmp(arg, "max-output-depth")) {
640 g_flags.max_output_depth = 1;
641 continue;
642 } else if (!strncmp(arg, "d=", 2) ||
643 !strncmp(arg, "max-output-depth=", 16)) {
644 while (*arg++ != '=') {
645 }
646 wuffs_base__result_u64 u = wuffs_base__parse_number_u64(
Nigel Tao6b7ce302020-07-07 16:19:46 +1000647 wuffs_base__make_slice_u8((uint8_t*)arg, strlen(arg)),
648 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
Nigel Tao94440cf2020-04-02 22:28:24 +1100649 if (wuffs_base__status__is_ok(&u.status) && (u.value <= 0xFFFFFFFF)) {
650 g_flags.max_output_depth = (uint32_t)(u.value);
651 continue;
652 }
653 return g_usage;
654 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100655 if (!strcmp(arg, "fail-if-unsandboxed")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100656 g_flags.fail_if_unsandboxed = true;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100657 continue;
658 }
Nigel Taoc766bb72020-07-09 12:59:32 +1000659 if (!strcmp(arg, "input-json-extra-comma")) {
660 g_flags.input_json_extra_comma = true;
661 continue;
662 }
Nigel Tao168f60a2020-07-14 13:19:33 +1000663 if (!strcmp(arg, "o=cbor") || !strcmp(arg, "output-format=cbor")) {
664 g_flags.output_format = file_format::cbor;
665 continue;
666 }
667 if (!strcmp(arg, "o=json") || !strcmp(arg, "output-format=json")) {
668 g_flags.output_format = file_format::json;
669 continue;
670 }
Nigel Taoc766bb72020-07-09 12:59:32 +1000671 if (!strcmp(arg, "output-json-extra-comma")) {
672 g_flags.output_json_extra_comma = true;
673 continue;
674 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100675 if (!strncmp(arg, "q=", 2) || !strncmp(arg, "query=", 6)) {
676 while (*arg++ != '=') {
677 }
Nigel Taod60815c2020-03-26 14:32:35 +1100678 g_flags.query_c_string = arg;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100679 continue;
680 }
Nigel Taoecadf722020-07-13 08:22:34 +1000681 if (!strncmp(arg, "s=", 2) || !strncmp(arg, "spaces=", 7)) {
682 while (*arg++ != '=') {
683 }
684 if (('0' <= arg[0]) && (arg[0] <= '8') && (arg[1] == '\x00')) {
685 g_flags.spaces = arg[0] - '0';
686 continue;
687 }
688 return g_usage;
689 }
690 if (!strcmp(arg, "strict-json-pointer-syntax")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100691 g_flags.strict_json_pointer_syntax = true;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100692 continue;
Nigel Tao68920952020-03-03 11:25:18 +1100693 }
694 if (!strcmp(arg, "t") || !strcmp(arg, "tabs")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100695 g_flags.tabs = true;
Nigel Tao68920952020-03-03 11:25:18 +1100696 continue;
697 }
698
Nigel Taod60815c2020-03-26 14:32:35 +1100699 return g_usage;
Nigel Tao68920952020-03-03 11:25:18 +1100700 }
701
Nigel Taod60815c2020-03-26 14:32:35 +1100702 if (g_flags.query_c_string &&
703 !Query::validate(g_flags.query_c_string, strlen(g_flags.query_c_string),
704 g_flags.strict_json_pointer_syntax)) {
Nigel Taod6fdfb12020-03-11 12:24:14 +1100705 return "main: bad JSON Pointer (RFC 6901) syntax for the -query=STR flag";
706 }
707
Nigel Taod60815c2020-03-26 14:32:35 +1100708 g_flags.remaining_argc = argc - c;
709 g_flags.remaining_argv = argv + c;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100710 return nullptr;
Nigel Tao68920952020-03-03 11:25:18 +1100711}
712
Nigel Tao2cf76db2020-02-27 22:42:01 +1100713const char* //
714initialize_globals(int argc, char** argv) {
Nigel Taod60815c2020-03-26 14:32:35 +1100715 g_dst = wuffs_base__make_io_buffer(
716 wuffs_base__make_slice_u8(g_dst_array, DST_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100717 wuffs_base__empty_io_buffer_meta());
Nigel Tao1b073492020-02-16 22:11:36 +1100718
Nigel Taod60815c2020-03-26 14:32:35 +1100719 g_src = wuffs_base__make_io_buffer(
720 wuffs_base__make_slice_u8(g_src_array, SRC_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100721 wuffs_base__empty_io_buffer_meta());
722
Nigel Taod60815c2020-03-26 14:32:35 +1100723 g_tok = wuffs_base__make_token_buffer(
724 wuffs_base__make_slice_token(g_tok_array, TOKEN_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100725 wuffs_base__empty_token_buffer_meta());
726
Nigel Taod60815c2020-03-26 14:32:35 +1100727 g_curr_token_end_src_index = 0;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100728
Nigel Taod60815c2020-03-26 14:32:35 +1100729 g_depth = 0;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100730
Nigel Taod60815c2020-03-26 14:32:35 +1100731 g_ctx = context::none;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100732
Nigel Tao68920952020-03-03 11:25:18 +1100733 TRY(parse_flags(argc, argv));
Nigel Taod60815c2020-03-26 14:32:35 +1100734 if (g_flags.fail_if_unsandboxed && !g_sandboxed) {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100735 return "main: unsandboxed";
736 }
Nigel Tao01abc842020-03-06 21:42:33 +1100737 const int stdin_fd = 0;
Nigel Taod60815c2020-03-26 14:32:35 +1100738 if (g_flags.remaining_argc >
739 ((g_input_file_descriptor != stdin_fd) ? 1 : 0)) {
740 return g_usage;
Nigel Tao107f0ef2020-03-01 21:35:02 +1100741 }
742
Nigel Taod60815c2020-03-26 14:32:35 +1100743 g_query.reset(g_flags.query_c_string);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100744
745 // If the query is non-empty, suprress writing to stdout until we've
746 // completed the query.
Nigel Taod60815c2020-03-26 14:32:35 +1100747 g_suppress_write_dst = g_query.next_fragment() ? 1 : 0;
748 g_wrote_to_dst = false;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100749
Nigel Taod60815c2020-03-26 14:32:35 +1100750 TRY(g_dec.initialize(sizeof__wuffs_json__decoder(), WUFFS_VERSION, 0)
Nigel Tao4b186b02020-03-18 14:25:21 +1100751 .message());
752
Nigel Taoc766bb72020-07-09 12:59:32 +1000753 if (g_flags.input_json_extra_comma) {
754 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_EXTRA_COMMA, true);
755 }
756
Nigel Tao4b186b02020-03-18 14:25:21 +1100757 // Consume an optional whitespace trailer. This isn't part of the JSON spec,
758 // but it works better with line oriented Unix tools (such as "echo 123 |
759 // jsonptr" where it's "echo", not "echo -n") or hand-edited JSON files which
760 // can accidentally contain trailing whitespace.
Nigel Taod60815c2020-03-26 14:32:35 +1100761 g_dec.set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE, true);
Nigel Tao4b186b02020-03-18 14:25:21 +1100762
763 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100764}
Nigel Tao1b073492020-02-16 22:11:36 +1100765
766// ----
767
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100768// ignore_return_value suppresses errors from -Wall -Werror.
769static void //
770ignore_return_value(int ignored) {}
771
Nigel Tao2914bae2020-02-26 09:40:30 +1100772const char* //
773read_src() {
Nigel Taod60815c2020-03-26 14:32:35 +1100774 if (g_src.meta.closed) {
Nigel Tao9cc2c252020-02-23 17:05:49 +1100775 return "main: internal error: read requested on a closed source";
Nigel Taoa8406922020-02-19 12:22:00 +1100776 }
Nigel Taod60815c2020-03-26 14:32:35 +1100777 g_src.compact();
778 if (g_src.meta.wi >= g_src.data.len) {
779 return "main: g_src buffer is full";
Nigel Tao1b073492020-02-16 22:11:36 +1100780 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100781 while (true) {
Nigel Taod60815c2020-03-26 14:32:35 +1100782 ssize_t n = read(g_input_file_descriptor, g_src.data.ptr + g_src.meta.wi,
783 g_src.data.len - g_src.meta.wi);
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100784 if (n >= 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100785 g_src.meta.wi += n;
786 g_src.meta.closed = n == 0;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100787 break;
788 } else if (errno != EINTR) {
789 return strerror(errno);
790 }
Nigel Tao1b073492020-02-16 22:11:36 +1100791 }
792 return nullptr;
793}
794
Nigel Tao2914bae2020-02-26 09:40:30 +1100795const char* //
796flush_dst() {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100797 while (true) {
Nigel Taod60815c2020-03-26 14:32:35 +1100798 size_t n = g_dst.meta.wi - g_dst.meta.ri;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100799 if (n == 0) {
800 break;
Nigel Tao1b073492020-02-16 22:11:36 +1100801 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100802 const int stdout_fd = 1;
Nigel Taod60815c2020-03-26 14:32:35 +1100803 ssize_t i = write(stdout_fd, g_dst.data.ptr + g_dst.meta.ri, n);
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100804 if (i >= 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100805 g_dst.meta.ri += i;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100806 } else if (errno != EINTR) {
807 return strerror(errno);
808 }
Nigel Tao1b073492020-02-16 22:11:36 +1100809 }
Nigel Taod60815c2020-03-26 14:32:35 +1100810 g_dst.compact();
Nigel Tao1b073492020-02-16 22:11:36 +1100811 return nullptr;
812}
813
Nigel Tao2914bae2020-02-26 09:40:30 +1100814const char* //
815write_dst(const void* s, size_t n) {
Nigel Taod60815c2020-03-26 14:32:35 +1100816 if (g_suppress_write_dst > 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100817 return nullptr;
818 }
Nigel Tao1b073492020-02-16 22:11:36 +1100819 const uint8_t* p = static_cast<const uint8_t*>(s);
820 while (n > 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100821 size_t i = g_dst.writer_available();
Nigel Tao1b073492020-02-16 22:11:36 +1100822 if (i == 0) {
823 const char* z = flush_dst();
824 if (z) {
825 return z;
826 }
Nigel Taod60815c2020-03-26 14:32:35 +1100827 i = g_dst.writer_available();
Nigel Tao1b073492020-02-16 22:11:36 +1100828 if (i == 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100829 return "main: g_dst buffer is full";
Nigel Tao1b073492020-02-16 22:11:36 +1100830 }
831 }
832
833 if (i > n) {
834 i = n;
835 }
Nigel Taod60815c2020-03-26 14:32:35 +1100836 memcpy(g_dst.data.ptr + g_dst.meta.wi, p, i);
837 g_dst.meta.wi += i;
Nigel Tao1b073492020-02-16 22:11:36 +1100838 p += i;
839 n -= i;
Nigel Taod60815c2020-03-26 14:32:35 +1100840 g_wrote_to_dst = true;
Nigel Tao1b073492020-02-16 22:11:36 +1100841 }
842 return nullptr;
843}
844
845// ----
846
Nigel Tao168f60a2020-07-14 13:19:33 +1000847const char* //
848write_literal(uint64_t vbd) {
849 const char* ptr = nullptr;
850 size_t len = 0;
851 if (vbd & WUFFS_BASE__TOKEN__VBD__LITERAL__UNDEFINED) {
852 if (g_flags.output_format == file_format::json) {
853 ptr = "null"; // JSON's closest approximation to "undefined".
854 len = 4;
855 } else {
856 ptr = "\xF7";
857 len = 1;
858 }
859 } else if (vbd & WUFFS_BASE__TOKEN__VBD__LITERAL__NULL) {
860 if (g_flags.output_format == file_format::json) {
861 ptr = "null";
862 len = 4;
863 } else {
864 ptr = "\xF6";
865 len = 1;
866 }
867 } else if (vbd & WUFFS_BASE__TOKEN__VBD__LITERAL__FALSE) {
868 if (g_flags.output_format == file_format::json) {
869 ptr = "false";
870 len = 5;
871 } else {
872 ptr = "\xF4";
873 len = 1;
874 }
875 } else if (vbd & WUFFS_BASE__TOKEN__VBD__LITERAL__TRUE) {
876 if (g_flags.output_format == file_format::json) {
877 ptr = "true";
878 len = 4;
879 } else {
880 ptr = "\xF5";
881 len = 1;
882 }
883 } else {
884 return "main: internal error: unexpected write_literal argument";
885 }
886 return write_dst(ptr, len);
887}
888
889// ----
890
891const char* //
892write_number_cbor_f64(double f) {
893 uint8_t buf[9];
894 wuffs_base__lossy_value_u16 lv16 =
895 wuffs_base__ieee_754_bit_representation__from_f64_to_u16_truncate(f);
896 if (!lv16.lossy) {
897 buf[0] = 0xF9;
898 wuffs_base__store_u16be__no_bounds_check(&buf[1], lv16.value);
899 return write_dst(&buf[0], 3);
900 }
901 wuffs_base__lossy_value_u32 lv32 =
902 wuffs_base__ieee_754_bit_representation__from_f64_to_u32_truncate(f);
903 if (!lv32.lossy) {
904 buf[0] = 0xFA;
905 wuffs_base__store_u32be__no_bounds_check(&buf[1], lv32.value);
906 return write_dst(&buf[0], 5);
907 }
908 buf[0] = 0xFB;
909 wuffs_base__store_u64be__no_bounds_check(
910 &buf[1], wuffs_base__ieee_754_bit_representation__from_f64_to_u64(f));
911 return write_dst(&buf[0], 9);
912}
913
914const char* //
915write_number_cbor_u64(uint8_t base, uint64_t u) {
916 uint8_t buf[9];
917 if (u < 0x18) {
918 buf[0] = base | ((uint8_t)u);
919 return write_dst(&buf[0], 1);
920 } else if ((u >> 8) == 0) {
921 buf[0] = base | 0x18;
922 buf[1] = ((uint8_t)u);
923 return write_dst(&buf[0], 2);
924 } else if ((u >> 16) == 0) {
925 buf[0] = base | 0x19;
926 wuffs_base__store_u16be__no_bounds_check(&buf[1], ((uint16_t)u));
927 return write_dst(&buf[0], 3);
928 } else if ((u >> 32) == 0) {
929 buf[0] = base | 0x1A;
930 wuffs_base__store_u32be__no_bounds_check(&buf[1], ((uint32_t)u));
931 return write_dst(&buf[0], 5);
932 }
933 buf[0] = base | 0x1B;
934 wuffs_base__store_u64be__no_bounds_check(&buf[1], u);
935 return write_dst(&buf[0], 9);
936}
937
938const char* //
939write_number(uint64_t vbd, uint8_t* ptr, size_t len) {
940 if (vbd & WUFFS_BASE__TOKEN__VBD__NUMBER__FORMAT_TEXT) {
941 if (g_flags.output_format == file_format::json) {
942 return write_dst(ptr, len);
943 }
944
945 // First try to parse (ptr, len) as an integer. Something like
946 // "1180591620717411303424" is a valid number (in the JSON sense) but will
947 // overflow int64_t or uint64_t, so fall back to parsing it as a float64.
948 if (vbd & WUFFS_BASE__TOKEN__VBD__NUMBER__CONTENT_INTEGER_SIGNED) {
949 if ((len > 0) && (ptr[0] == '-')) {
950 wuffs_base__result_i64 ri = wuffs_base__parse_number_i64(
951 wuffs_base__make_slice_u8(ptr, len),
952 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
953 if (ri.status.is_ok()) {
954 return write_number_cbor_u64(0x20, ~ri.value);
955 }
956 } else {
957 wuffs_base__result_u64 ru = wuffs_base__parse_number_u64(
958 wuffs_base__make_slice_u8(ptr, len),
959 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
960 if (ru.status.is_ok()) {
961 return write_number_cbor_u64(0x00, ru.value);
962 }
963 }
964 }
965
966 if (vbd & WUFFS_BASE__TOKEN__VBD__NUMBER__CONTENT_FLOATING_POINT) {
967 wuffs_base__result_f64 rf = wuffs_base__parse_number_f64(
968 wuffs_base__make_slice_u8(ptr, len),
969 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
970 if (rf.status.is_ok()) {
971 return write_number_cbor_f64(rf.value);
972 }
973 }
974 }
975
976 return "main: internal error: unexpected write_number argument";
977}
978
979// ----
980
Nigel Tao2914bae2020-02-26 09:40:30 +1100981uint8_t //
982hex_digit(uint8_t nibble) {
Nigel Taob5461bd2020-02-21 14:13:37 +1100983 nibble &= 0x0F;
984 if (nibble <= 9) {
985 return '0' + nibble;
986 }
987 return ('A' - 10) + nibble;
988}
989
Nigel Tao2914bae2020-02-26 09:40:30 +1100990const char* //
Nigel Tao168f60a2020-07-14 13:19:33 +1000991flush_cbor_output_string() {
992 uint8_t prefix[3];
993 prefix[0] = g_cbor_output_string_is_utf_8 ? 0x60 : 0x40;
994 if (g_cbor_output_string_length < 0x18) {
995 prefix[0] |= g_cbor_output_string_length;
996 TRY(write_dst(&prefix[0], 1));
997 } else if (g_cbor_output_string_length <= 0xFF) {
998 prefix[0] |= 0x18;
999 prefix[1] = g_cbor_output_string_length;
1000 TRY(write_dst(&prefix[0], 2));
1001 } else if (g_cbor_output_string_length <= 0xFFFF) {
1002 prefix[0] |= 0x19;
1003 prefix[1] = g_cbor_output_string_length >> 8;
1004 prefix[2] = g_cbor_output_string_length;
1005 TRY(write_dst(&prefix[0], 3));
1006 } else {
1007 return "main: internal error: CBOR string output is too long";
1008 }
1009
1010 size_t n = g_cbor_output_string_length;
1011 g_cbor_output_string_length = 0;
1012 return write_dst(&g_cbor_output_string_array[0], n);
1013}
1014
1015const char* //
1016write_cbor_output_string(uint8_t* ptr, size_t len, bool finish) {
1017 // Check that g_cbor_output_string_array can hold any UTF-8 code point.
1018 if (CBOR_OUTPUT_STRING_ARRAY_SIZE < 4) {
1019 return "main: internal error: CBOR_OUTPUT_STRING_ARRAY_SIZE is too short";
1020 }
1021
1022 while (len > 0) {
1023 size_t available =
1024 CBOR_OUTPUT_STRING_ARRAY_SIZE - g_cbor_output_string_length;
1025 if (available >= len) {
1026 memcpy(&g_cbor_output_string_array[g_cbor_output_string_length], ptr,
1027 len);
1028 g_cbor_output_string_length += len;
1029 ptr += len;
1030 len = 0;
1031 break;
1032
1033 } else if (available > 0) {
1034 if (!g_cbor_output_string_is_multiple_chunks) {
1035 g_cbor_output_string_is_multiple_chunks = true;
1036 TRY(write_dst(g_cbor_output_string_is_utf_8 ? "\x7F" : "\x5F", 1));
Nigel Tao3b486982020-02-27 15:05:59 +11001037 }
Nigel Tao168f60a2020-07-14 13:19:33 +10001038
1039 if (g_cbor_output_string_is_utf_8) {
1040 // Walk the end backwards to a UTF-8 boundary, so that each chunk of
1041 // the multi-chunk string is also valid UTF-8.
1042 while (available > 0) {
1043 wuffs_base__utf_8__next__output o = wuffs_base__utf_8__next_from_end(
1044 wuffs_base__make_slice_u8(ptr, available));
1045 if ((o.code_point != WUFFS_BASE__UNICODE_REPLACEMENT_CHARACTER) ||
1046 (o.byte_length != 1)) {
1047 break;
1048 }
1049 available--;
1050 }
1051 }
1052
1053 memcpy(&g_cbor_output_string_array[g_cbor_output_string_length], ptr,
1054 available);
1055 g_cbor_output_string_length += available;
1056 ptr += available;
1057 len -= available;
Nigel Tao3b486982020-02-27 15:05:59 +11001058 }
1059
Nigel Tao168f60a2020-07-14 13:19:33 +10001060 TRY(flush_cbor_output_string());
1061 }
Nigel Taob9ad34f2020-03-03 12:44:01 +11001062
Nigel Tao168f60a2020-07-14 13:19:33 +10001063 if (finish) {
1064 TRY(flush_cbor_output_string());
1065 if (g_cbor_output_string_is_multiple_chunks) {
1066 TRY(write_dst("\xFF", 1));
1067 }
1068 }
1069 return nullptr;
1070}
Nigel Taob9ad34f2020-03-03 12:44:01 +11001071
Nigel Tao168f60a2020-07-14 13:19:33 +10001072const char* //
1073handle_string(uint64_t vbd,
1074 uint64_t len,
1075 bool start_of_token_chain,
1076 bool continued) {
1077 if (start_of_token_chain) {
1078 if (g_flags.output_format == file_format::json) {
1079 TRY(write_dst("\"", 1));
1080 } else {
1081 g_cbor_output_string_length = 0;
1082 g_cbor_output_string_is_multiple_chunks = false;
1083 g_cbor_output_string_is_utf_8 =
1084 vbd & WUFFS_BASE__TOKEN__VBD__STRING__CHAIN_MUST_BE_UTF_8;
1085 }
1086 g_query.restart_fragment(in_dict_before_key() && g_query.is_at(g_depth));
1087 }
1088
1089 if (vbd & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_0_DST_1_SRC_DROP) {
1090 // No-op.
1091 } else if (vbd & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) {
1092 uint8_t* ptr = g_src.data.ptr + g_curr_token_end_src_index - len;
1093 if (g_flags.output_format == file_format::json) {
1094 // TODO: if the input is CBOR but the output is JSON then we have to
1095 // escape '\n', '\"', etc.
1096 TRY(write_dst(ptr, len));
1097 } else {
1098 TRY(write_cbor_output_string(ptr, len, false));
1099 }
1100 g_query.incremental_match_slice(ptr, len);
Nigel Taob9ad34f2020-03-03 12:44:01 +11001101 } else {
Nigel Tao168f60a2020-07-14 13:19:33 +10001102 return "main: internal error: unexpected string-token conversion";
1103 }
1104
1105 if (continued) {
1106 return nullptr;
1107 }
1108
1109 if (g_flags.output_format == file_format::json) {
1110 TRY(write_dst("\"", 1));
1111 } else {
1112 TRY(write_cbor_output_string(nullptr, 0, true));
1113 }
1114 return nullptr;
1115}
1116
1117const char* //
1118handle_unicode_code_point(uint32_t ucp) {
1119 if (g_flags.output_format == file_format::json) {
1120 if (ucp < 0x0020) {
1121 switch (ucp) {
1122 case '\b':
1123 return write_dst("\\b", 2);
1124 case '\f':
1125 return write_dst("\\f", 2);
1126 case '\n':
1127 return write_dst("\\n", 2);
1128 case '\r':
1129 return write_dst("\\r", 2);
1130 case '\t':
1131 return write_dst("\\t", 2);
1132 }
1133
1134 // Other bytes less than 0x0020 are valid UTF-8 but not valid in a
1135 // JSON string. They need to remain escaped.
1136 uint8_t esc6[6];
1137 esc6[0] = '\\';
1138 esc6[1] = 'u';
1139 esc6[2] = '0';
1140 esc6[3] = '0';
1141 esc6[4] = hex_digit(ucp >> 4);
1142 esc6[5] = hex_digit(ucp >> 0);
1143 return write_dst(&esc6[0], 6);
1144
1145 } else if (ucp == '\"') {
1146 return write_dst("\\\"", 2);
1147
1148 } else if (ucp == '\\') {
1149 return write_dst("\\\\", 2);
Nigel Tao3b486982020-02-27 15:05:59 +11001150 }
Nigel Tao3b486982020-02-27 15:05:59 +11001151 }
1152
Nigel Tao168f60a2020-07-14 13:19:33 +10001153 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
1154 size_t n = wuffs_base__utf_8__encode(
1155 wuffs_base__make_slice_u8(&u[0],
1156 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
1157 ucp);
1158 if (n == 0) {
1159 return "main: internal error: unexpected Unicode code point";
1160 }
1161
1162 if (g_flags.output_format == file_format::json) {
1163 return write_dst(&u[0], n);
1164 }
1165 return write_cbor_output_string(&u[0], n, false);
Nigel Tao3b486982020-02-27 15:05:59 +11001166}
1167
1168const char* //
Nigel Tao2ef39992020-04-09 17:24:39 +10001169handle_token(wuffs_base__token t, bool start_of_token_chain) {
Nigel Tao2cf76db2020-02-27 22:42:01 +11001170 do {
Nigel Tao462f8662020-04-01 23:01:51 +11001171 int64_t vbc = t.value_base_category();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001172 uint64_t vbd = t.value_base_detail();
1173 uint64_t len = t.length();
Nigel Tao1b073492020-02-16 22:11:36 +11001174
1175 // Handle ']' or '}'.
Nigel Tao9f7a2502020-02-23 09:42:02 +11001176 if ((vbc == WUFFS_BASE__TOKEN__VBC__STRUCTURE) &&
Nigel Tao2cf76db2020-02-27 22:42:01 +11001177 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__POP)) {
Nigel Taod60815c2020-03-26 14:32:35 +11001178 if (g_query.is_at(g_depth)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001179 return "main: no match for query";
1180 }
Nigel Taod60815c2020-03-26 14:32:35 +11001181 if (g_depth <= 0) {
1182 return "main: internal error: inconsistent g_depth";
Nigel Tao1b073492020-02-16 22:11:36 +11001183 }
Nigel Taod60815c2020-03-26 14:32:35 +11001184 g_depth--;
Nigel Tao1b073492020-02-16 22:11:36 +11001185
Nigel Taod60815c2020-03-26 14:32:35 +11001186 if (g_query.matched_all() && (g_depth >= g_flags.max_output_depth)) {
1187 g_suppress_write_dst--;
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001188 // '…' is U+2026 HORIZONTAL ELLIPSIS, which is 3 UTF-8 bytes.
Nigel Tao168f60a2020-07-14 13:19:33 +10001189 if (g_flags.output_format == file_format::json) {
1190 TRY(write_dst((vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST)
1191 ? "\"[…]\""
1192 : "\"{…}\"",
1193 7));
1194 } else {
1195 TRY(write_dst((vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST)
1196 ? "\x65[…]"
1197 : "\x65{…}",
1198 6));
1199 }
1200 } else if (g_flags.output_format == file_format::json) {
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001201 // Write preceding whitespace.
Nigel Taod60815c2020-03-26 14:32:35 +11001202 if ((g_ctx != context::in_list_after_bracket) &&
1203 (g_ctx != context::in_dict_after_brace) &&
1204 !g_flags.compact_output) {
Nigel Taoc766bb72020-07-09 12:59:32 +10001205 if (g_flags.output_json_extra_comma) {
1206 TRY(write_dst(",\n", 2));
1207 } else {
1208 TRY(write_dst("\n", 1));
1209 }
Nigel Taod60815c2020-03-26 14:32:35 +11001210 for (uint32_t i = 0; i < g_depth; i++) {
1211 TRY(write_dst(
1212 g_flags.tabs ? INDENT_TAB_STRING : INDENT_SPACES_STRING,
Nigel Taoecadf722020-07-13 08:22:34 +10001213 g_flags.tabs ? 1 : g_flags.spaces));
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001214 }
Nigel Tao1b073492020-02-16 22:11:36 +11001215 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001216
1217 TRY(write_dst(
1218 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST) ? "]" : "}",
1219 1));
Nigel Tao168f60a2020-07-14 13:19:33 +10001220 } else {
1221 TRY(write_dst("\xFF", 1));
Nigel Tao1b073492020-02-16 22:11:36 +11001222 }
1223
Nigel Taod60815c2020-03-26 14:32:35 +11001224 g_ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
1225 ? context::in_list_after_value
1226 : context::in_dict_after_key;
Nigel Tao1b073492020-02-16 22:11:36 +11001227 goto after_value;
1228 }
1229
Nigel Taod1c928a2020-02-28 12:43:53 +11001230 // Write preceding whitespace and punctuation, if it wasn't ']', '}' or a
1231 // continuation of a multi-token chain.
Nigel Tao2ef39992020-04-09 17:24:39 +10001232 if (start_of_token_chain) {
Nigel Tao168f60a2020-07-14 13:19:33 +10001233 if (g_flags.output_format != file_format::json) {
1234 // No-op.
1235 } else if (g_ctx == context::in_dict_after_key) {
Nigel Taod60815c2020-03-26 14:32:35 +11001236 TRY(write_dst(": ", g_flags.compact_output ? 1 : 2));
1237 } else if (g_ctx != context::none) {
1238 if ((g_ctx != context::in_list_after_bracket) &&
1239 (g_ctx != context::in_dict_after_brace)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001240 TRY(write_dst(",", 1));
Nigel Tao107f0ef2020-03-01 21:35:02 +11001241 }
Nigel Taod60815c2020-03-26 14:32:35 +11001242 if (!g_flags.compact_output) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001243 TRY(write_dst("\n", 1));
Nigel Taod60815c2020-03-26 14:32:35 +11001244 for (size_t i = 0; i < g_depth; i++) {
1245 TRY(write_dst(
1246 g_flags.tabs ? INDENT_TAB_STRING : INDENT_SPACES_STRING,
Nigel Taoecadf722020-07-13 08:22:34 +10001247 g_flags.tabs ? 1 : g_flags.spaces));
Nigel Tao0cd2f982020-03-03 23:03:02 +11001248 }
1249 }
1250 }
1251
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001252 bool query_matched_fragment = false;
Nigel Taod60815c2020-03-26 14:32:35 +11001253 if (g_query.is_at(g_depth)) {
1254 switch (g_ctx) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001255 case context::in_list_after_bracket:
1256 case context::in_list_after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001257 query_matched_fragment = g_query.tick();
Nigel Tao0cd2f982020-03-03 23:03:02 +11001258 break;
1259 case context::in_dict_after_key:
Nigel Taod60815c2020-03-26 14:32:35 +11001260 query_matched_fragment = g_query.matched_fragment();
Nigel Tao0cd2f982020-03-03 23:03:02 +11001261 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +11001262 default:
1263 break;
Nigel Tao0cd2f982020-03-03 23:03:02 +11001264 }
1265 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001266 if (!query_matched_fragment) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001267 // No-op.
Nigel Taod60815c2020-03-26 14:32:35 +11001268 } else if (!g_query.next_fragment()) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001269 // There is no next fragment. We have matched the complete query, and
1270 // the upcoming JSON value is the result of that query.
1271 //
Nigel Taod60815c2020-03-26 14:32:35 +11001272 // Un-suppress writing to stdout and reset the g_ctx and g_depth as if
1273 // we were about to decode a top-level value. This makes any subsequent
1274 // indentation be relative to this point, and we will return g_eod
1275 // after the upcoming JSON value is complete.
1276 if (g_suppress_write_dst != 1) {
1277 return "main: internal error: inconsistent g_suppress_write_dst";
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001278 }
Nigel Taod60815c2020-03-26 14:32:35 +11001279 g_suppress_write_dst = 0;
1280 g_ctx = context::none;
1281 g_depth = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +11001282 } else if ((vbc != WUFFS_BASE__TOKEN__VBC__STRUCTURE) ||
1283 !(vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__PUSH)) {
1284 // The query has moved on to the next fragment but the upcoming JSON
1285 // value is not a container.
1286 return "main: no match for query";
Nigel Tao1b073492020-02-16 22:11:36 +11001287 }
1288 }
1289
1290 // Handle the token itself: either a container ('[' or '{') or a simple
Nigel Tao85fba7f2020-02-29 16:28:06 +11001291 // value: string (a chain of raw or escaped parts), literal or number.
Nigel Tao1b073492020-02-16 22:11:36 +11001292 switch (vbc) {
Nigel Tao85fba7f2020-02-29 16:28:06 +11001293 case WUFFS_BASE__TOKEN__VBC__STRUCTURE:
Nigel Taod60815c2020-03-26 14:32:35 +11001294 if (g_query.matched_all() && (g_depth >= g_flags.max_output_depth)) {
1295 g_suppress_write_dst++;
Nigel Tao168f60a2020-07-14 13:19:33 +10001296 } else if (g_flags.output_format == file_format::json) {
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001297 TRY(write_dst(
1298 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST) ? "[" : "{",
1299 1));
Nigel Tao168f60a2020-07-14 13:19:33 +10001300 } else {
1301 TRY(write_dst((vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
1302 ? "\x9F"
1303 : "\xBF",
1304 1));
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001305 }
Nigel Taod60815c2020-03-26 14:32:35 +11001306 g_depth++;
1307 g_ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
1308 ? context::in_list_after_bracket
1309 : context::in_dict_after_brace;
Nigel Tao85fba7f2020-02-29 16:28:06 +11001310 return nullptr;
1311
Nigel Tao2cf76db2020-02-27 22:42:01 +11001312 case WUFFS_BASE__TOKEN__VBC__STRING:
Nigel Tao168f60a2020-07-14 13:19:33 +10001313 TRY(handle_string(vbd, len, start_of_token_chain, t.continued()));
Nigel Tao496e88b2020-04-09 22:10:08 +10001314 if (t.continued()) {
Nigel Tao2cf76db2020-02-27 22:42:01 +11001315 return nullptr;
1316 }
Nigel Tao2cf76db2020-02-27 22:42:01 +11001317 goto after_value;
1318
1319 case WUFFS_BASE__TOKEN__VBC__UNICODE_CODE_POINT:
Nigel Tao496e88b2020-04-09 22:10:08 +10001320 if (!t.continued()) {
1321 return "main: internal error: unexpected non-continued UCP token";
Nigel Tao0cd2f982020-03-03 23:03:02 +11001322 }
1323 TRY(handle_unicode_code_point(vbd));
Nigel Taod60815c2020-03-26 14:32:35 +11001324 g_query.incremental_match_code_point(vbd);
Nigel Tao0cd2f982020-03-03 23:03:02 +11001325 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001326
Nigel Tao85fba7f2020-02-29 16:28:06 +11001327 case WUFFS_BASE__TOKEN__VBC__LITERAL:
Nigel Tao168f60a2020-07-14 13:19:33 +10001328 TRY(write_literal(vbd));
1329 goto after_value;
1330
Nigel Tao2cf76db2020-02-27 22:42:01 +11001331 case WUFFS_BASE__TOKEN__VBC__NUMBER:
Nigel Tao168f60a2020-07-14 13:19:33 +10001332 TRY(write_number(vbd, g_src.data.ptr + g_curr_token_end_src_index - len,
1333 len));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001334 goto after_value;
Nigel Tao1b073492020-02-16 22:11:36 +11001335 }
1336
1337 // Return an error if we didn't match the (vbc, vbd) pair.
Nigel Tao2cf76db2020-02-27 22:42:01 +11001338 return "main: internal error: unexpected token";
1339 } while (0);
Nigel Tao1b073492020-02-16 22:11:36 +11001340
Nigel Tao2cf76db2020-02-27 22:42:01 +11001341 // Book-keeping after completing a value (whether a container value or a
1342 // simple value). Empty parent containers are no longer empty. If the parent
1343 // container is a "{...}" object, toggle between keys and values.
1344after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001345 if (g_depth == 0) {
1346 return g_eod;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001347 }
Nigel Taod60815c2020-03-26 14:32:35 +11001348 switch (g_ctx) {
Nigel Tao2cf76db2020-02-27 22:42:01 +11001349 case context::in_list_after_bracket:
Nigel Taod60815c2020-03-26 14:32:35 +11001350 g_ctx = context::in_list_after_value;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001351 break;
1352 case context::in_dict_after_brace:
Nigel Taod60815c2020-03-26 14:32:35 +11001353 g_ctx = context::in_dict_after_key;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001354 break;
1355 case context::in_dict_after_key:
Nigel Taod60815c2020-03-26 14:32:35 +11001356 g_ctx = context::in_dict_after_value;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001357 break;
1358 case context::in_dict_after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001359 g_ctx = context::in_dict_after_key;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001360 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +11001361 default:
1362 break;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001363 }
1364 return nullptr;
1365}
1366
1367const char* //
1368main1(int argc, char** argv) {
1369 TRY(initialize_globals(argc, argv));
1370
Nigel Taocd183f92020-07-14 12:11:05 +10001371 bool start_of_token_chain = true;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001372 while (true) {
Nigel Taod60815c2020-03-26 14:32:35 +11001373 wuffs_base__status status = g_dec.decode_tokens(
1374 &g_tok, &g_src,
1375 wuffs_base__make_slice_u8(g_work_buffer_array, WORK_BUFFER_ARRAY_SIZE));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001376
Nigel Taod60815c2020-03-26 14:32:35 +11001377 while (g_tok.meta.ri < g_tok.meta.wi) {
1378 wuffs_base__token t = g_tok.data.ptr[g_tok.meta.ri++];
Nigel Tao2cf76db2020-02-27 22:42:01 +11001379 uint64_t n = t.length();
Nigel Taod60815c2020-03-26 14:32:35 +11001380 if ((g_src.meta.ri - g_curr_token_end_src_index) < n) {
1381 return "main: internal error: inconsistent g_src indexes";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001382 }
Nigel Taod60815c2020-03-26 14:32:35 +11001383 g_curr_token_end_src_index += n;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001384
Nigel Taod0b16cb2020-03-14 10:15:54 +11001385 // Skip filler tokens (e.g. whitespace).
Nigel Tao2cf76db2020-02-27 22:42:01 +11001386 if (t.value() == 0) {
Nigel Tao496e88b2020-04-09 22:10:08 +10001387 start_of_token_chain = !t.continued();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001388 continue;
1389 }
1390
Nigel Tao2ef39992020-04-09 17:24:39 +10001391 const char* z = handle_token(t, start_of_token_chain);
Nigel Tao496e88b2020-04-09 22:10:08 +10001392 start_of_token_chain = !t.continued();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001393 if (z == nullptr) {
1394 continue;
Nigel Taod60815c2020-03-26 14:32:35 +11001395 } else if (z == g_eod) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001396 goto end_of_data;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001397 }
1398 return z;
Nigel Tao1b073492020-02-16 22:11:36 +11001399 }
Nigel Tao2cf76db2020-02-27 22:42:01 +11001400
1401 if (status.repr == nullptr) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001402 return "main: internal error: unexpected end of token stream";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001403 } else if (status.repr == wuffs_base__suspension__short_read) {
Nigel Taod60815c2020-03-26 14:32:35 +11001404 if (g_curr_token_end_src_index != g_src.meta.ri) {
1405 return "main: internal error: inconsistent g_src indexes";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001406 }
1407 TRY(read_src());
Nigel Taod60815c2020-03-26 14:32:35 +11001408 g_curr_token_end_src_index = g_src.meta.ri;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001409 } else if (status.repr == wuffs_base__suspension__short_write) {
Nigel Taod60815c2020-03-26 14:32:35 +11001410 g_tok.compact();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001411 } else {
1412 return status.message();
Nigel Tao1b073492020-02-16 22:11:36 +11001413 }
1414 }
Nigel Tao0cd2f982020-03-03 23:03:02 +11001415end_of_data:
1416
Nigel Taod60815c2020-03-26 14:32:35 +11001417 // With a non-empty g_query, don't try to consume trailing whitespace or
Nigel Tao0cd2f982020-03-03 23:03:02 +11001418 // confirm that we've processed all the tokens.
Nigel Taod60815c2020-03-26 14:32:35 +11001419 if (g_flags.query_c_string && *g_flags.query_c_string) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001420 return nullptr;
1421 }
Nigel Tao6b161af2020-02-24 11:01:48 +11001422
Nigel Tao6b161af2020-02-24 11:01:48 +11001423 // Check that we've exhausted the input.
Nigel Taod60815c2020-03-26 14:32:35 +11001424 if ((g_src.meta.ri == g_src.meta.wi) && !g_src.meta.closed) {
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001425 TRY(read_src());
1426 }
Nigel Taod60815c2020-03-26 14:32:35 +11001427 if ((g_src.meta.ri < g_src.meta.wi) || !g_src.meta.closed) {
Nigel Tao6b161af2020-02-24 11:01:48 +11001428 return "main: valid JSON followed by further (unexpected) data";
1429 }
1430
1431 // Check that we've used all of the decoded tokens, other than trailing
Nigel Tao4b186b02020-03-18 14:25:21 +11001432 // filler tokens. For example, "true\n" is valid JSON (and fully consumed
1433 // with WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE enabled) with a trailing
1434 // filler token for the "\n".
Nigel Taod60815c2020-03-26 14:32:35 +11001435 for (; g_tok.meta.ri < g_tok.meta.wi; g_tok.meta.ri++) {
1436 if (g_tok.data.ptr[g_tok.meta.ri].value_base_category() !=
Nigel Tao6b161af2020-02-24 11:01:48 +11001437 WUFFS_BASE__TOKEN__VBC__FILLER) {
1438 return "main: internal error: decoded OK but unprocessed tokens remain";
1439 }
1440 }
1441
1442 return nullptr;
Nigel Tao1b073492020-02-16 22:11:36 +11001443}
1444
Nigel Tao2914bae2020-02-26 09:40:30 +11001445int //
1446compute_exit_code(const char* status_msg) {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001447 if (!status_msg) {
1448 return 0;
1449 }
Nigel Tao01abc842020-03-06 21:42:33 +11001450 size_t n;
Nigel Taod60815c2020-03-26 14:32:35 +11001451 if (status_msg == g_usage) {
Nigel Tao01abc842020-03-06 21:42:33 +11001452 n = strlen(status_msg);
1453 } else {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001454 n = strnlen(status_msg, 2047);
Nigel Tao01abc842020-03-06 21:42:33 +11001455 if (n >= 2047) {
1456 status_msg = "main: internal error: error message is too long";
1457 n = strnlen(status_msg, 2047);
1458 }
Nigel Tao9cc2c252020-02-23 17:05:49 +11001459 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001460 const int stderr_fd = 2;
1461 ignore_return_value(write(stderr_fd, status_msg, n));
1462 ignore_return_value(write(stderr_fd, "\n", 1));
Nigel Tao9cc2c252020-02-23 17:05:49 +11001463 // Return an exit code of 1 for regular (forseen) errors, e.g. badly
1464 // formatted or unsupported input.
1465 //
1466 // Return an exit code of 2 for internal (exceptional) errors, e.g. defensive
1467 // run-time checks found that an internal invariant did not hold.
1468 //
1469 // Automated testing, including badly formatted inputs, can therefore
1470 // discriminate between expected failure (exit code 1) and unexpected failure
1471 // (other non-zero exit codes). Specifically, exit code 2 for internal
1472 // invariant violation, exit code 139 (which is 128 + SIGSEGV on x86_64
1473 // linux) for a segmentation fault (e.g. null pointer dereference).
1474 return strstr(status_msg, "internal error:") ? 2 : 1;
1475}
1476
Nigel Tao2914bae2020-02-26 09:40:30 +11001477int //
1478main(int argc, char** argv) {
Nigel Tao01abc842020-03-06 21:42:33 +11001479 // Look for an input filename (the first non-flag argument) in argv. If there
1480 // is one, open it (but do not read from it) before we self-impose a sandbox.
1481 //
1482 // Flags start with "-", unless it comes after a bare "--" arg.
1483 {
1484 bool dash_dash = false;
1485 int a;
1486 for (a = 1; a < argc; a++) {
1487 char* arg = argv[a];
1488 if ((arg[0] == '-') && !dash_dash) {
1489 dash_dash = (arg[1] == '-') && (arg[2] == '\x00');
1490 continue;
1491 }
Nigel Taod60815c2020-03-26 14:32:35 +11001492 g_input_file_descriptor = open(arg, O_RDONLY);
1493 if (g_input_file_descriptor < 0) {
Nigel Tao01abc842020-03-06 21:42:33 +11001494 fprintf(stderr, "%s: %s\n", arg, strerror(errno));
1495 return 1;
1496 }
1497 break;
1498 }
1499 }
1500
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001501#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1502 prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
Nigel Taod60815c2020-03-26 14:32:35 +11001503 g_sandboxed = true;
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001504#endif
1505
Nigel Tao0cd2f982020-03-03 23:03:02 +11001506 const char* z = main1(argc, argv);
Nigel Taod60815c2020-03-26 14:32:35 +11001507 if (g_wrote_to_dst) {
Nigel Tao168f60a2020-07-14 13:19:33 +10001508 const char* z1 = (g_flags.output_format == file_format::json)
1509 ? write_dst("\n", 1)
1510 : nullptr;
Nigel Tao0cd2f982020-03-03 23:03:02 +11001511 const char* z2 = flush_dst();
1512 z = z ? z : (z1 ? z1 : z2);
1513 }
1514 int exit_code = compute_exit_code(z);
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001515
1516#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1517 // Call SYS_exit explicitly, instead of calling SYS_exit_group implicitly by
1518 // either calling _exit or returning from main. SECCOMP_MODE_STRICT allows
1519 // only SYS_exit.
1520 syscall(SYS_exit, exit_code);
1521#endif
Nigel Tao9cc2c252020-02-23 17:05:49 +11001522 return exit_code;
Nigel Tao1b073492020-02-16 22:11:36 +11001523}