blob: b5f86b6614425116e346d69ecf53ed51f6a675f7 [file] [log] [blame]
Nigel Tao1b073492020-02-16 22:11:36 +11001// Copyright 2020 The Wuffs Authors.
2//
3// Licensed under the Apache License, Version 2.0 (the "License");
4// you may not use this file except in compliance with the License.
5// You may obtain a copy of the License at
6//
7// https://www.apache.org/licenses/LICENSE-2.0
8//
9// Unless required by applicable law or agreed to in writing, software
10// distributed under the License is distributed on an "AS IS" BASIS,
11// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12// See the License for the specific language governing permissions and
13// limitations under the License.
14
15// ----------------
16
17/*
Nigel Tao0cd2f982020-03-03 23:03:02 +110018jsonptr is a JSON formatter (pretty-printer) that supports the JSON Pointer
Nigel Tao168f60a2020-07-14 13:19:33 +100019(RFC 6901) query syntax. It reads CBOR or UTF-8 JSON from stdin and writes CBOR
20or canonicalized, formatted UTF-8 JSON to stdout.
Nigel Tao0cd2f982020-03-03 23:03:02 +110021
Nigel Taod60815c2020-03-26 14:32:35 +110022See the "const char* g_usage" string below for details.
Nigel Tao0cd2f982020-03-03 23:03:02 +110023
24----
25
26JSON Pointer (and this program's implementation) is one of many JSON query
27languages and JSON tools, such as jq, jql and JMESPath. This one is relatively
28simple and fewer-featured compared to those others.
29
Nigel Tao168f60a2020-07-14 13:19:33 +100030One benefit of simplicity is that this program's CBOR, JSON and JSON Pointer
Nigel Tao0cd2f982020-03-03 23:03:02 +110031implementations do not dynamically allocate or free memory (yet it does not
32require that the entire input fits in memory at once). They are therefore
33trivially protected against certain bug classes: memory leaks, double-frees and
34use-after-frees.
35
Nigel Tao168f60a2020-07-14 13:19:33 +100036The CBOR and JSON implementations are also written in the Wuffs programming
37language (and then transpiled to C/C++), which is memory-safe (e.g. array
38indexing is bounds-checked) but also prevents integer arithmetic overflows.
Nigel Tao0cd2f982020-03-03 23:03:02 +110039
Nigel Taofe0cbbd2020-03-05 22:01:30 +110040For defense in depth, on Linux, this program also self-imposes a
41SECCOMP_MODE_STRICT sandbox before reading (or otherwise processing) its input
42or writing its output. Under this sandbox, the only permitted system calls are
43read, write, exit and sigreturn.
44
Nigel Tao168f60a2020-07-14 13:19:33 +100045All together, this program aims to safely handle untrusted CBOR or JSON files
46without fear of security bugs such as remote code execution.
Nigel Tao0cd2f982020-03-03 23:03:02 +110047
48----
Nigel Tao1b073492020-02-16 22:11:36 +110049
Nigel Taoc5b3a9e2020-02-24 11:54:35 +110050As of 2020-02-24, this program passes all 318 "test_parsing" cases from the
51JSON test suite (https://github.com/nst/JSONTestSuite), an appendix to the
52"Parsing JSON is a Minefield" article (http://seriot.ch/parsing_json.php) that
53was first published on 2016-10-26 and updated on 2018-03-30.
54
Nigel Tao0cd2f982020-03-03 23:03:02 +110055After modifying this program, run "build-example.sh example/jsonptr/" and then
56"script/run-json-test-suite.sh" to catch correctness regressions.
57
58----
59
Nigel Taod0b16cb2020-03-14 10:15:54 +110060This program uses Wuffs' JSON decoder at a relatively low level, processing the
61decoder's token-stream output individually. The core loop, in pseudo-code, is
62"for_each_token { handle_token(etc); }", where the handle_token function
Nigel Taod60815c2020-03-26 14:32:35 +110063changes global state (e.g. the `g_depth` and `g_ctx` variables) and prints
Nigel Taod0b16cb2020-03-14 10:15:54 +110064output text based on that state and the token's source text. Notably,
65handle_token is not recursive, even though JSON values can nest.
66
67This approach is centered around JSON tokens. Each JSON 'thing' (e.g. number,
68string, object) comprises one or more JSON tokens.
69
70An alternative, higher-level approach is in the sibling example/jsonfindptrs
71program. Neither approach is better or worse per se, but when studying this
72program, be aware that there are multiple ways to use Wuffs' JSON decoder.
73
74The two programs, jsonfindptrs and jsonptr, also demonstrate different
75trade-offs with regard to JSON object duplicate keys. The JSON spec permits
76different implementations to allow or reject duplicate keys. It is not always
77clear which approach is safer. Rejecting them is certainly unambiguous, and
78security bugs can lurk in ambiguous corners of a file format, if two different
79implementations both silently accept a file but differ on how to interpret it.
80On the other hand, in the worst case, detecting duplicate keys requires O(N)
81memory, where N is the size of the (potentially untrusted) input.
82
83This program (jsonptr) allows duplicate keys and requires only O(1) memory. As
84mentioned above, it doesn't dynamically allocate memory at all, and on Linux,
85it runs in a SECCOMP_MODE_STRICT sandbox.
86
87----
88
Nigel Tao1b073492020-02-16 22:11:36 +110089This example program differs from most other example Wuffs programs in that it
90is written in C++, not C.
91
92$CXX jsonptr.cc && ./a.out < ../../test/data/github-tags.json; rm -f a.out
93
94for a C++ compiler $CXX, such as clang++ or g++.
95*/
96
Nigel Tao721190a2020-04-03 22:25:21 +110097#if defined(__cplusplus) && (__cplusplus < 201103L)
98#error "This C++ program requires -std=c++11 or later"
99#endif
100
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100101#include <errno.h>
Nigel Tao01abc842020-03-06 21:42:33 +1100102#include <fcntl.h>
103#include <stdio.h>
Nigel Tao9cc2c252020-02-23 17:05:49 +1100104#include <string.h>
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100105#include <unistd.h>
Nigel Tao1b073492020-02-16 22:11:36 +1100106
107// Wuffs ships as a "single file C library" or "header file library" as per
108// https://github.com/nothings/stb/blob/master/docs/stb_howto.txt
109//
110// To use that single file as a "foo.c"-like implementation, instead of a
111// "foo.h"-like header, #define WUFFS_IMPLEMENTATION before #include'ing or
112// compiling it.
113#define WUFFS_IMPLEMENTATION
114
115// Defining the WUFFS_CONFIG__MODULE* macros are optional, but it lets users of
116// release/c/etc.c whitelist which parts of Wuffs to build. That file contains
117// the entire Wuffs standard library, implementing a variety of codecs and file
118// formats. Without this macro definition, an optimizing compiler or linker may
119// very well discard Wuffs code for unused codecs, but listing the Wuffs
120// modules we use makes that process explicit. Preprocessing means that such
121// code simply isn't compiled.
122#define WUFFS_CONFIG__MODULES
123#define WUFFS_CONFIG__MODULE__BASE
Nigel Tao4e193592020-07-15 12:48:57 +1000124#define WUFFS_CONFIG__MODULE__CBOR
Nigel Tao1b073492020-02-16 22:11:36 +1100125#define WUFFS_CONFIG__MODULE__JSON
126
127// If building this program in an environment that doesn't easily accommodate
128// relative includes, you can use the script/inline-c-relative-includes.go
129// program to generate a stand-alone C++ file.
130#include "../../release/c/wuffs-unsupported-snapshot.c"
131
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100132#if defined(__linux__)
133#include <linux/prctl.h>
134#include <linux/seccomp.h>
135#include <sys/prctl.h>
136#include <sys/syscall.h>
137#define WUFFS_EXAMPLE_USE_SECCOMP
138#endif
139
Nigel Tao2cf76db2020-02-27 22:42:01 +1100140#define TRY(error_msg) \
141 do { \
142 const char* z = error_msg; \
143 if (z) { \
144 return z; \
145 } \
146 } while (false)
147
Nigel Taod60815c2020-03-26 14:32:35 +1100148static const char* g_eod = "main: end of data";
Nigel Tao2cf76db2020-02-27 22:42:01 +1100149
Nigel Taod60815c2020-03-26 14:32:35 +1100150static const char* g_usage =
Nigel Tao01abc842020-03-06 21:42:33 +1100151 "Usage: jsonptr -flags input.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100152 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100153 "Flags:\n"
Nigel Tao3690e832020-03-12 16:52:26 +1100154 " -c -compact-output\n"
Nigel Tao94440cf2020-04-02 22:28:24 +1100155 " -d=NUM -max-output-depth=NUM\n"
Nigel Tao4e193592020-07-15 12:48:57 +1000156 " -i=FMT -input-format={json,cbor}\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000157 " -o=FMT -output-format={json,cbor}\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100158 " -q=STR -query=STR\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000159 " -s=NUM -spaces=NUM\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100160 " -t -tabs\n"
161 " -fail-if-unsandboxed\n"
Nigel Taoc766bb72020-07-09 12:59:32 +1000162 " -input-json-extra-comma\n"
163 " -output-json-extra-comma\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000164 " -strict-json-pointer-syntax\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100165 "\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100166 "The input.json filename is optional. If absent, it reads from stdin.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100167 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100168 "----\n"
169 "\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100170 "jsonptr is a JSON formatter (pretty-printer) that supports the JSON\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000171 "Pointer (RFC 6901) query syntax. It reads CBOR or UTF-8 JSON from stdin\n"
172 "and writes CBOR or canonicalized, formatted UTF-8 JSON to stdout. The\n"
173 "input and output formats do not have to match, but conversion between\n"
174 "formats may be lossy.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100175 "\n"
176 "Canonicalized means that e.g. \"abc\\u000A\\tx\\u0177z\" is re-written\n"
177 "as \"abc\\n\\txÅ·z\". It does not sort object keys, nor does it reject\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100178 "duplicate keys. Canonicalization does not imply Unicode normalization.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100179 "\n"
180 "Formatted means that arrays' and objects' elements are indented, each\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000181 "on its own line. Configure this with the -c / -compact-output, -s=NUM /\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000182 "-spaces=NUM (for NUM ranging from 0 to 8) and -t / -tabs flags. Those\n"
183 "flags only apply to JSON (not CBOR) output.\n"
184 "\n"
185 "The -input-format and -output-format flags select between reading and\n"
186 "writing JSON (the default, a textual format) or CBOR (a binary format).\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100187 "\n"
Nigel Taoc766bb72020-07-09 12:59:32 +1000188 "The -input-json-extra-comma flag allows input like \"[1,2,]\", with a\n"
189 "comma after the final element of a JSON list or dictionary.\n"
190 "\n"
191 "The -output-json-extra-comma flag writes extra commas, regardless of\n"
192 "whether the input had it. Extra commas are non-compliant with the JSON\n"
193 "specification but many parsers accept it and it can produce simpler\n"
194 "line-based diffs. This flag is ignored when -compact-output is set.\n"
195 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100196 "----\n"
197 "\n"
198 "The -q=STR or -query=STR flag gives an optional JSON Pointer query, to\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100199 "print a subset of the input. For example, given RFC 6901 section 5's\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100200 "sample input (https://tools.ietf.org/rfc/rfc6901.txt), this command:\n"
201 " jsonptr -query=/foo/1 rfc-6901-json-pointer.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100202 "will print:\n"
203 " \"baz\"\n"
204 "\n"
205 "An absent query is equivalent to the empty query, which identifies the\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100206 "entire input (the root value). Unlike a file system, the \"/\" query\n"
Nigel Taod0b16cb2020-03-14 10:15:54 +1100207 "does not identify the root. Instead, \"\" is the root and \"/\" is the\n"
208 "child (the value in a key-value pair) of the root whose key is the empty\n"
209 "string. Similarly, \"/xyz\" and \"/xyz/\" are two different nodes.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100210 "\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000211 "If the query found a valid JSON|CBOR value, this program will return a\n"
212 "zero exit code even if the rest of the input isn't valid. If the query\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100213 "did not find a value, or found an invalid one, this program returns a\n"
214 "non-zero exit code, but may still print partial output to stdout.\n"
215 "\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000216 "The JSON and CBOR specifications (https://json.org/ or RFC 8259; RFC\n"
217 "7049) permit implementations to allow duplicate keys, as this one does.\n"
218 "This JSON Pointer implementation is also greedy, following the first\n"
219 "match for each fragment without back-tracking. For example, the\n"
220 "\"/foo/bar\" query will fail if the root object has multiple \"foo\"\n"
221 "children but the first one doesn't have a \"bar\" child, even if later\n"
222 "ones do.\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100223 "\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000224 "The -strict-json-pointer-syntax flag restricts the -query=STR string to\n"
225 "exactly RFC 6901, with only two escape sequences: \"~0\" and \"~1\" for\n"
226 "\"~\" and \"/\". Without this flag, this program also lets \"~n\" and\n"
227 "\"~r\" escape the New Line and Carriage Return ASCII control characters,\n"
228 "which can work better with line oriented Unix tools that assume exactly\n"
229 "one value (i.e. one JSON Pointer string) per line.\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100230 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100231 "----\n"
232 "\n"
Nigel Tao94440cf2020-04-02 22:28:24 +1100233 "The -d=NUM or -max-output-depth=NUM flag gives the maximum (inclusive)\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000234 "output depth. JSON|CBOR containers ([] arrays and {} objects) can hold\n"
235 "other containers. When this flag is set, containers at depth NUM are\n"
236 "replaced with \"[…]\" or \"{…}\". A bare -d or -max-output-depth is\n"
237 "equivalent to -d=1. The flag's absence means an unlimited output depth.\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100238 "\n"
239 "The -max-output-depth flag only affects the program's output. It doesn't\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000240 "affect whether or not the input is considered valid JSON|CBOR. The\n"
241 "format specifications permit implementations to set their own maximum\n"
242 "input depth. This JSON|CBOR implementation sets it to 1024.\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100243 "\n"
244 "Depth is measured in terms of nested containers. It is unaffected by the\n"
245 "number of spaces or tabs used to indent.\n"
246 "\n"
247 "When both -max-output-depth and -query are set, the output depth is\n"
248 "measured from when the query resolves, not from the input root. The\n"
249 "input depth (measured from the root) is still limited to 1024.\n"
250 "\n"
251 "----\n"
252 "\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100253 "The -fail-if-unsandboxed flag causes the program to exit if it does not\n"
254 "self-impose a sandbox. On Linux, it self-imposes a SECCOMP_MODE_STRICT\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100255 "sandbox, regardless of whether this flag was set.";
Nigel Tao0cd2f982020-03-03 23:03:02 +1100256
Nigel Tao2cf76db2020-02-27 22:42:01 +1100257// ----
258
Nigel Taof3146c22020-03-26 08:47:42 +1100259// Wuffs allows either statically or dynamically allocated work buffers. This
260// program exercises static allocation.
261#define WORK_BUFFER_ARRAY_SIZE \
262 WUFFS_JSON__DECODER_WORKBUF_LEN_MAX_INCL_WORST_CASE
263#if WORK_BUFFER_ARRAY_SIZE > 0
Nigel Taod60815c2020-03-26 14:32:35 +1100264uint8_t g_work_buffer_array[WORK_BUFFER_ARRAY_SIZE];
Nigel Taof3146c22020-03-26 08:47:42 +1100265#else
266// Not all C/C++ compilers support 0-length arrays.
Nigel Taod60815c2020-03-26 14:32:35 +1100267uint8_t g_work_buffer_array[1];
Nigel Taof3146c22020-03-26 08:47:42 +1100268#endif
269
Nigel Taod60815c2020-03-26 14:32:35 +1100270bool g_sandboxed = false;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100271
Nigel Taod60815c2020-03-26 14:32:35 +1100272int g_input_file_descriptor = 0; // A 0 default means stdin.
Nigel Tao01abc842020-03-06 21:42:33 +1100273
Nigel Tao2cf76db2020-02-27 22:42:01 +1100274#define MAX_INDENT 8
Nigel Tao107f0ef2020-03-01 21:35:02 +1100275#define INDENT_SPACES_STRING " "
Nigel Tao6e7d1412020-03-06 09:21:35 +1100276#define INDENT_TAB_STRING "\t"
Nigel Tao107f0ef2020-03-01 21:35:02 +1100277
Nigel Taofdac24a2020-03-06 21:53:08 +1100278#ifndef DST_BUFFER_ARRAY_SIZE
279#define DST_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100280#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100281#ifndef SRC_BUFFER_ARRAY_SIZE
282#define SRC_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100283#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100284#ifndef TOKEN_BUFFER_ARRAY_SIZE
285#define TOKEN_BUFFER_ARRAY_SIZE (4 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100286#endif
287
Nigel Taod60815c2020-03-26 14:32:35 +1100288uint8_t g_dst_array[DST_BUFFER_ARRAY_SIZE];
289uint8_t g_src_array[SRC_BUFFER_ARRAY_SIZE];
290wuffs_base__token g_tok_array[TOKEN_BUFFER_ARRAY_SIZE];
Nigel Tao1b073492020-02-16 22:11:36 +1100291
Nigel Taod60815c2020-03-26 14:32:35 +1100292wuffs_base__io_buffer g_dst;
293wuffs_base__io_buffer g_src;
294wuffs_base__token_buffer g_tok;
Nigel Tao1b073492020-02-16 22:11:36 +1100295
Nigel Taod60815c2020-03-26 14:32:35 +1100296// g_curr_token_end_src_index is the g_src.data.ptr index of the end of the
297// current token. An invariant is that (g_curr_token_end_src_index <=
298// g_src.meta.ri).
299size_t g_curr_token_end_src_index;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100300
Nigel Taod60815c2020-03-26 14:32:35 +1100301uint32_t g_depth;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100302
303enum class context {
304 none,
305 in_list_after_bracket,
306 in_list_after_value,
307 in_dict_after_brace,
308 in_dict_after_key,
309 in_dict_after_value,
Nigel Taod60815c2020-03-26 14:32:35 +1100310} g_ctx;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100311
Nigel Tao0cd2f982020-03-03 23:03:02 +1100312bool //
313in_dict_before_key() {
Nigel Taod60815c2020-03-26 14:32:35 +1100314 return (g_ctx == context::in_dict_after_brace) ||
315 (g_ctx == context::in_dict_after_value);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100316}
317
Nigel Taod60815c2020-03-26 14:32:35 +1100318uint32_t g_suppress_write_dst;
319bool g_wrote_to_dst;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100320
Nigel Tao4e193592020-07-15 12:48:57 +1000321wuffs_cbor__decoder g_cbor_decoder;
322wuffs_json__decoder g_json_decoder;
323wuffs_base__token_decoder* g_dec;
Nigel Tao1b073492020-02-16 22:11:36 +1100324
Nigel Tao168f60a2020-07-14 13:19:33 +1000325// cbor_output_string_array is a 4 KiB buffer. For -output-format=cbor, strings
326// whose length are 4096 or less are written as a single definite-length
327// string. Longer strings are written as an indefinite-length string containing
328// multiple definite-length chunks, each of length up to 4 KiB. See the CBOR
329// RFC (RFC 7049) section 2.2.2 "Indefinite-Length Byte Strings and Text
330// Strings". The output is determinate even when the input is streamed.
331//
332// If raising CBOR_OUTPUT_STRING_ARRAY_SIZE above 0xFFFF then you will also
333// have to update flush_cbor_output_string.
334#define CBOR_OUTPUT_STRING_ARRAY_SIZE 4096
335uint8_t g_cbor_output_string_array[CBOR_OUTPUT_STRING_ARRAY_SIZE];
336
337uint32_t g_cbor_output_string_length;
338bool g_cbor_output_string_is_multiple_chunks;
339bool g_cbor_output_string_is_utf_8;
340
Nigel Tao0cd2f982020-03-03 23:03:02 +1100341// ----
342
343// Query is a JSON Pointer query. After initializing with a NUL-terminated C
344// string, its multiple fragments are consumed as the program walks the JSON
345// data from stdin. For example, letting "$" denote a NUL, suppose that we
346// started with a query string of "/apple/banana/12/durian" and are currently
Nigel Taob48ee752020-03-13 09:27:33 +1100347// trying to match the second fragment, "banana", so that Query::m_depth is 2:
Nigel Tao0cd2f982020-03-03 23:03:02 +1100348//
349// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
350// / a p p l e / b a n a n a / 1 2 / d u r i a n $
351// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
352// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100353// m_frag_i m_frag_k
Nigel Tao0cd2f982020-03-03 23:03:02 +1100354//
Nigel Taob48ee752020-03-13 09:27:33 +1100355// The two pointers m_frag_i and m_frag_k (abbreviated as mfi and mfk) are the
356// start (inclusive) and end (exclusive) of the query fragment. They satisfy
357// (mfi <= mfk) and may be equal if the fragment empty (note that "" is a valid
358// JSON object key).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100359//
Nigel Taob48ee752020-03-13 09:27:33 +1100360// The m_frag_j (mfj) pointer moves between these two, or is nullptr. An
361// invariant is that (((mfi <= mfj) && (mfj <= mfk)) || (mfj == nullptr)).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100362//
363// Wuffs' JSON tokenizer can portray a single JSON string as multiple Wuffs
364// tokens, as backslash-escaped values within that JSON string may each get
365// their own token.
366//
Nigel Taob48ee752020-03-13 09:27:33 +1100367// At the start of each object key (a JSON string), mfj is set to mfi.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100368//
Nigel Taob48ee752020-03-13 09:27:33 +1100369// While mfj remains non-nullptr, each token's unescaped contents are then
370// compared to that part of the fragment from mfj to mfk. If it is a prefix
371// (including the case of an exact match), then mfj is advanced by the
372// unescaped length. Otherwise, mfj is set to nullptr.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100373//
374// Comparison accounts for JSON Pointer's escaping notation: "~0" and "~1" in
375// the query (not the JSON value) are unescaped to "~" and "/" respectively.
Nigel Taob48ee752020-03-13 09:27:33 +1100376// "~n" and "~r" are also unescaped to "\n" and "\r". The program is
377// responsible for calling Query::validate (with a strict_json_pointer_syntax
378// argument) before otherwise using this class.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100379//
Nigel Taob48ee752020-03-13 09:27:33 +1100380// The mfj pointer therefore advances from mfi to mfk, or drops out, as we
381// incrementally match the object key with the query fragment. For example, if
382// we have already matched the "ban" of "banana", then we would accept any of
383// an "ana" token, an "a" token or a "\u0061" token, amongst others. They would
384// advance mfj by 3, 1 or 1 bytes respectively.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100385//
Nigel Taob48ee752020-03-13 09:27:33 +1100386// mfj
Nigel Tao0cd2f982020-03-03 23:03:02 +1100387// v
388// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
389// / a p p l e / b a n a n a / 1 2 / d u r i a n $
390// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
391// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100392// mfi mfk
Nigel Tao0cd2f982020-03-03 23:03:02 +1100393//
394// At the end of each object key (or equivalently, at the start of each object
Nigel Taob48ee752020-03-13 09:27:33 +1100395// value), if mfj is non-nullptr and equal to (but not less than) mfk then we
396// have a fragment match: the query fragment equals the object key. If there is
397// a next fragment (in this example, "12") we move the frag_etc pointers to its
398// start and end and increment Query::m_depth. Otherwise, we have matched the
399// complete query, and the upcoming JSON value is the result of that query.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100400//
401// The discussion above centers on object keys. If the query fragment is
402// numeric then it can also match as an array index: the string fragment "12"
403// will match an array's 13th element (starting counting from zero). See RFC
404// 6901 for its precise definition of an "array index" number.
405//
Nigel Taob48ee752020-03-13 09:27:33 +1100406// Array index fragment match is represented by the Query::m_array_index field,
Nigel Tao0cd2f982020-03-03 23:03:02 +1100407// whose type (wuffs_base__result_u64) is a result type. An error result means
408// that the fragment is not an array index. A value result holds the number of
409// list elements remaining. When matching a query fragment in an array (instead
410// of in an object), each element ticks this number down towards zero. At zero,
411// the upcoming JSON value is the one that matches the query fragment.
412class Query {
413 private:
Nigel Taob48ee752020-03-13 09:27:33 +1100414 uint8_t* m_frag_i;
415 uint8_t* m_frag_j;
416 uint8_t* m_frag_k;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100417
Nigel Taob48ee752020-03-13 09:27:33 +1100418 uint32_t m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100419
Nigel Taob48ee752020-03-13 09:27:33 +1100420 wuffs_base__result_u64 m_array_index;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100421
422 public:
423 void reset(char* query_c_string) {
Nigel Taob48ee752020-03-13 09:27:33 +1100424 m_frag_i = (uint8_t*)query_c_string;
425 m_frag_j = (uint8_t*)query_c_string;
426 m_frag_k = (uint8_t*)query_c_string;
427 m_depth = 0;
428 m_array_index.status.repr = "#main: not an array index query fragment";
429 m_array_index.value = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100430 }
431
Nigel Taob48ee752020-03-13 09:27:33 +1100432 void restart_fragment(bool enable) { m_frag_j = enable ? m_frag_i : nullptr; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100433
Nigel Taob48ee752020-03-13 09:27:33 +1100434 bool is_at(uint32_t depth) { return m_depth == depth; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100435
436 // tick returns whether the fragment is a valid array index whose value is
437 // zero. If valid but non-zero, it decrements it and returns false.
438 bool tick() {
Nigel Taob48ee752020-03-13 09:27:33 +1100439 if (m_array_index.status.is_ok()) {
440 if (m_array_index.value == 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100441 return true;
442 }
Nigel Taob48ee752020-03-13 09:27:33 +1100443 m_array_index.value--;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100444 }
445 return false;
446 }
447
448 // next_fragment moves to the next fragment, returning whether it existed.
449 bool next_fragment() {
Nigel Taob48ee752020-03-13 09:27:33 +1100450 uint8_t* k = m_frag_k;
451 uint32_t d = m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100452
453 this->reset(nullptr);
454
455 if (!k || (*k != '/')) {
456 return false;
457 }
458 k++;
459
460 bool all_digits = true;
461 uint8_t* i = k;
462 while ((*k != '\x00') && (*k != '/')) {
463 all_digits = all_digits && ('0' <= *k) && (*k <= '9');
464 k++;
465 }
Nigel Taob48ee752020-03-13 09:27:33 +1100466 m_frag_i = i;
467 m_frag_j = i;
468 m_frag_k = k;
469 m_depth = d + 1;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100470 if (all_digits) {
471 // wuffs_base__parse_number_u64 rejects leading zeroes, e.g. "00", "07".
Nigel Tao6b7ce302020-07-07 16:19:46 +1000472 m_array_index = wuffs_base__parse_number_u64(
473 wuffs_base__make_slice_u8(i, k - i),
474 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100475 }
476 return true;
477 }
478
Nigel Taob48ee752020-03-13 09:27:33 +1100479 bool matched_all() { return m_frag_k == nullptr; }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100480
Nigel Taob48ee752020-03-13 09:27:33 +1100481 bool matched_fragment() { return m_frag_j && (m_frag_j == m_frag_k); }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100482
483 void incremental_match_slice(uint8_t* ptr, size_t len) {
Nigel Taob48ee752020-03-13 09:27:33 +1100484 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100485 return;
486 }
Nigel Taob48ee752020-03-13 09:27:33 +1100487 uint8_t* j = m_frag_j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100488 while (true) {
489 if (len == 0) {
Nigel Taob48ee752020-03-13 09:27:33 +1100490 m_frag_j = j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100491 return;
492 }
493
494 if (*j == '\x00') {
495 break;
496
497 } else if (*j == '~') {
498 j++;
499 if (*j == '0') {
500 if (*ptr != '~') {
501 break;
502 }
503 } else if (*j == '1') {
504 if (*ptr != '/') {
505 break;
506 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100507 } else if (*j == 'n') {
508 if (*ptr != '\n') {
509 break;
510 }
511 } else if (*j == 'r') {
512 if (*ptr != '\r') {
513 break;
514 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100515 } else {
516 break;
517 }
518
519 } else if (*j != *ptr) {
520 break;
521 }
522
523 j++;
524 ptr++;
525 len--;
526 }
Nigel Taob48ee752020-03-13 09:27:33 +1100527 m_frag_j = nullptr;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100528 }
529
530 void incremental_match_code_point(uint32_t code_point) {
Nigel Taob48ee752020-03-13 09:27:33 +1100531 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100532 return;
533 }
534 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
535 size_t n = wuffs_base__utf_8__encode(
536 wuffs_base__make_slice_u8(&u[0],
537 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
538 code_point);
539 if (n > 0) {
540 this->incremental_match_slice(&u[0], n);
541 }
542 }
543
544 // validate returns whether the (ptr, len) arguments form a valid JSON
545 // Pointer. In particular, it must be valid UTF-8, and either be empty or
546 // start with a '/'. Any '~' within must immediately be followed by either
Nigel Taod6fdfb12020-03-11 12:24:14 +1100547 // '0' or '1'. If strict_json_pointer_syntax is false, a '~' may also be
548 // followed by either 'n' or 'r'.
549 static bool validate(char* query_c_string,
550 size_t length,
551 bool strict_json_pointer_syntax) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100552 if (length <= 0) {
553 return true;
554 }
555 if (query_c_string[0] != '/') {
556 return false;
557 }
558 wuffs_base__slice_u8 s =
559 wuffs_base__make_slice_u8((uint8_t*)query_c_string, length);
560 bool previous_was_tilde = false;
561 while (s.len > 0) {
562 wuffs_base__utf_8__next__output o = wuffs_base__utf_8__next(s);
563 if (!o.is_valid()) {
564 return false;
565 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100566
567 if (previous_was_tilde) {
568 switch (o.code_point) {
569 case '0':
570 case '1':
571 break;
572 case 'n':
573 case 'r':
574 if (strict_json_pointer_syntax) {
575 return false;
576 }
577 break;
578 default:
579 return false;
580 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100581 }
582 previous_was_tilde = o.code_point == '~';
Nigel Taod6fdfb12020-03-11 12:24:14 +1100583
Nigel Tao0cd2f982020-03-03 23:03:02 +1100584 s.ptr += o.byte_length;
585 s.len -= o.byte_length;
586 }
587 return !previous_was_tilde;
588 }
Nigel Taod60815c2020-03-26 14:32:35 +1100589} g_query;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100590
591// ----
592
Nigel Tao168f60a2020-07-14 13:19:33 +1000593enum class file_format {
594 json,
595 cbor,
596};
597
Nigel Tao68920952020-03-03 11:25:18 +1100598struct {
599 int remaining_argc;
600 char** remaining_argv;
601
Nigel Tao3690e832020-03-12 16:52:26 +1100602 bool compact_output;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100603 bool fail_if_unsandboxed;
Nigel Tao4e193592020-07-15 12:48:57 +1000604 file_format input_format;
Nigel Taoc766bb72020-07-09 12:59:32 +1000605 bool input_json_extra_comma;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100606 uint32_t max_output_depth;
Nigel Tao168f60a2020-07-14 13:19:33 +1000607 file_format output_format;
Nigel Taoc766bb72020-07-09 12:59:32 +1000608 bool output_json_extra_comma;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100609 char* query_c_string;
Nigel Taoecadf722020-07-13 08:22:34 +1000610 size_t spaces;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100611 bool strict_json_pointer_syntax;
Nigel Tao68920952020-03-03 11:25:18 +1100612 bool tabs;
Nigel Taod60815c2020-03-26 14:32:35 +1100613} g_flags = {0};
Nigel Tao68920952020-03-03 11:25:18 +1100614
615const char* //
616parse_flags(int argc, char** argv) {
Nigel Taoecadf722020-07-13 08:22:34 +1000617 g_flags.spaces = 4;
Nigel Taod60815c2020-03-26 14:32:35 +1100618 g_flags.max_output_depth = 0xFFFFFFFF;
Nigel Tao68920952020-03-03 11:25:18 +1100619
620 int c = (argc > 0) ? 1 : 0; // Skip argv[0], the program name.
621 for (; c < argc; c++) {
622 char* arg = argv[c];
623 if (*arg++ != '-') {
624 break;
625 }
626
627 // A double-dash "--foo" is equivalent to a single-dash "-foo". As special
628 // cases, a bare "-" is not a flag (some programs may interpret it as
629 // stdin) and a bare "--" means to stop parsing flags.
630 if (*arg == '\x00') {
631 break;
632 } else if (*arg == '-') {
633 arg++;
634 if (*arg == '\x00') {
635 c++;
636 break;
637 }
638 }
639
Nigel Tao3690e832020-03-12 16:52:26 +1100640 if (!strcmp(arg, "c") || !strcmp(arg, "compact-output")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100641 g_flags.compact_output = true;
Nigel Tao68920952020-03-03 11:25:18 +1100642 continue;
643 }
Nigel Tao94440cf2020-04-02 22:28:24 +1100644 if (!strcmp(arg, "d") || !strcmp(arg, "max-output-depth")) {
645 g_flags.max_output_depth = 1;
646 continue;
647 } else if (!strncmp(arg, "d=", 2) ||
648 !strncmp(arg, "max-output-depth=", 16)) {
649 while (*arg++ != '=') {
650 }
651 wuffs_base__result_u64 u = wuffs_base__parse_number_u64(
Nigel Tao6b7ce302020-07-07 16:19:46 +1000652 wuffs_base__make_slice_u8((uint8_t*)arg, strlen(arg)),
653 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
Nigel Tao94440cf2020-04-02 22:28:24 +1100654 if (wuffs_base__status__is_ok(&u.status) && (u.value <= 0xFFFFFFFF)) {
655 g_flags.max_output_depth = (uint32_t)(u.value);
656 continue;
657 }
658 return g_usage;
659 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100660 if (!strcmp(arg, "fail-if-unsandboxed")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100661 g_flags.fail_if_unsandboxed = true;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100662 continue;
663 }
Nigel Tao4e193592020-07-15 12:48:57 +1000664 if (!strcmp(arg, "i=cbor") || !strcmp(arg, "input-format=cbor")) {
665 g_flags.input_format = file_format::cbor;
666 continue;
667 }
668 if (!strcmp(arg, "i=json") || !strcmp(arg, "input-format=json")) {
669 g_flags.input_format = file_format::json;
670 continue;
671 }
Nigel Taoc766bb72020-07-09 12:59:32 +1000672 if (!strcmp(arg, "input-json-extra-comma")) {
673 g_flags.input_json_extra_comma = true;
674 continue;
675 }
Nigel Tao168f60a2020-07-14 13:19:33 +1000676 if (!strcmp(arg, "o=cbor") || !strcmp(arg, "output-format=cbor")) {
677 g_flags.output_format = file_format::cbor;
678 continue;
679 }
680 if (!strcmp(arg, "o=json") || !strcmp(arg, "output-format=json")) {
681 g_flags.output_format = file_format::json;
682 continue;
683 }
Nigel Taoc766bb72020-07-09 12:59:32 +1000684 if (!strcmp(arg, "output-json-extra-comma")) {
685 g_flags.output_json_extra_comma = true;
686 continue;
687 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100688 if (!strncmp(arg, "q=", 2) || !strncmp(arg, "query=", 6)) {
689 while (*arg++ != '=') {
690 }
Nigel Taod60815c2020-03-26 14:32:35 +1100691 g_flags.query_c_string = arg;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100692 continue;
693 }
Nigel Taoecadf722020-07-13 08:22:34 +1000694 if (!strncmp(arg, "s=", 2) || !strncmp(arg, "spaces=", 7)) {
695 while (*arg++ != '=') {
696 }
697 if (('0' <= arg[0]) && (arg[0] <= '8') && (arg[1] == '\x00')) {
698 g_flags.spaces = arg[0] - '0';
699 continue;
700 }
701 return g_usage;
702 }
703 if (!strcmp(arg, "strict-json-pointer-syntax")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100704 g_flags.strict_json_pointer_syntax = true;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100705 continue;
Nigel Tao68920952020-03-03 11:25:18 +1100706 }
707 if (!strcmp(arg, "t") || !strcmp(arg, "tabs")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100708 g_flags.tabs = true;
Nigel Tao68920952020-03-03 11:25:18 +1100709 continue;
710 }
711
Nigel Taod60815c2020-03-26 14:32:35 +1100712 return g_usage;
Nigel Tao68920952020-03-03 11:25:18 +1100713 }
714
Nigel Taod60815c2020-03-26 14:32:35 +1100715 if (g_flags.query_c_string &&
716 !Query::validate(g_flags.query_c_string, strlen(g_flags.query_c_string),
717 g_flags.strict_json_pointer_syntax)) {
Nigel Taod6fdfb12020-03-11 12:24:14 +1100718 return "main: bad JSON Pointer (RFC 6901) syntax for the -query=STR flag";
719 }
720
Nigel Taod60815c2020-03-26 14:32:35 +1100721 g_flags.remaining_argc = argc - c;
722 g_flags.remaining_argv = argv + c;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100723 return nullptr;
Nigel Tao68920952020-03-03 11:25:18 +1100724}
725
Nigel Tao2cf76db2020-02-27 22:42:01 +1100726const char* //
727initialize_globals(int argc, char** argv) {
Nigel Taod60815c2020-03-26 14:32:35 +1100728 g_dst = wuffs_base__make_io_buffer(
729 wuffs_base__make_slice_u8(g_dst_array, DST_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100730 wuffs_base__empty_io_buffer_meta());
Nigel Tao1b073492020-02-16 22:11:36 +1100731
Nigel Taod60815c2020-03-26 14:32:35 +1100732 g_src = wuffs_base__make_io_buffer(
733 wuffs_base__make_slice_u8(g_src_array, SRC_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100734 wuffs_base__empty_io_buffer_meta());
735
Nigel Taod60815c2020-03-26 14:32:35 +1100736 g_tok = wuffs_base__make_token_buffer(
737 wuffs_base__make_slice_token(g_tok_array, TOKEN_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100738 wuffs_base__empty_token_buffer_meta());
739
Nigel Taod60815c2020-03-26 14:32:35 +1100740 g_curr_token_end_src_index = 0;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100741
Nigel Taod60815c2020-03-26 14:32:35 +1100742 g_depth = 0;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100743
Nigel Taod60815c2020-03-26 14:32:35 +1100744 g_ctx = context::none;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100745
Nigel Tao68920952020-03-03 11:25:18 +1100746 TRY(parse_flags(argc, argv));
Nigel Taod60815c2020-03-26 14:32:35 +1100747 if (g_flags.fail_if_unsandboxed && !g_sandboxed) {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100748 return "main: unsandboxed";
749 }
Nigel Tao01abc842020-03-06 21:42:33 +1100750 const int stdin_fd = 0;
Nigel Taod60815c2020-03-26 14:32:35 +1100751 if (g_flags.remaining_argc >
752 ((g_input_file_descriptor != stdin_fd) ? 1 : 0)) {
753 return g_usage;
Nigel Tao107f0ef2020-03-01 21:35:02 +1100754 }
755
Nigel Taod60815c2020-03-26 14:32:35 +1100756 g_query.reset(g_flags.query_c_string);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100757
758 // If the query is non-empty, suprress writing to stdout until we've
759 // completed the query.
Nigel Taod60815c2020-03-26 14:32:35 +1100760 g_suppress_write_dst = g_query.next_fragment() ? 1 : 0;
761 g_wrote_to_dst = false;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100762
Nigel Tao4e193592020-07-15 12:48:57 +1000763 if (g_flags.input_format == file_format::json) {
764 TRY(g_json_decoder
765 .initialize(sizeof__wuffs_json__decoder(), WUFFS_VERSION, 0)
766 .message());
767 g_dec = g_json_decoder.upcast_as__wuffs_base__token_decoder();
768 } else {
769 TRY(g_cbor_decoder
770 .initialize(sizeof__wuffs_cbor__decoder(), WUFFS_VERSION, 0)
771 .message());
772 g_dec = g_cbor_decoder.upcast_as__wuffs_base__token_decoder();
773 }
Nigel Tao4b186b02020-03-18 14:25:21 +1100774
Nigel Taoc766bb72020-07-09 12:59:32 +1000775 if (g_flags.input_json_extra_comma) {
Nigel Tao4e193592020-07-15 12:48:57 +1000776 g_dec->set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_EXTRA_COMMA, true);
Nigel Taoc766bb72020-07-09 12:59:32 +1000777 }
778
Nigel Tao4b186b02020-03-18 14:25:21 +1100779 // Consume an optional whitespace trailer. This isn't part of the JSON spec,
780 // but it works better with line oriented Unix tools (such as "echo 123 |
781 // jsonptr" where it's "echo", not "echo -n") or hand-edited JSON files which
782 // can accidentally contain trailing whitespace.
Nigel Tao4e193592020-07-15 12:48:57 +1000783 g_dec->set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE, true);
Nigel Tao4b186b02020-03-18 14:25:21 +1100784
785 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100786}
Nigel Tao1b073492020-02-16 22:11:36 +1100787
788// ----
789
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100790// ignore_return_value suppresses errors from -Wall -Werror.
791static void //
792ignore_return_value(int ignored) {}
793
Nigel Tao2914bae2020-02-26 09:40:30 +1100794const char* //
795read_src() {
Nigel Taod60815c2020-03-26 14:32:35 +1100796 if (g_src.meta.closed) {
Nigel Tao9cc2c252020-02-23 17:05:49 +1100797 return "main: internal error: read requested on a closed source";
Nigel Taoa8406922020-02-19 12:22:00 +1100798 }
Nigel Taod60815c2020-03-26 14:32:35 +1100799 g_src.compact();
800 if (g_src.meta.wi >= g_src.data.len) {
801 return "main: g_src buffer is full";
Nigel Tao1b073492020-02-16 22:11:36 +1100802 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100803 while (true) {
Nigel Taod60815c2020-03-26 14:32:35 +1100804 ssize_t n = read(g_input_file_descriptor, g_src.data.ptr + g_src.meta.wi,
805 g_src.data.len - g_src.meta.wi);
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100806 if (n >= 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100807 g_src.meta.wi += n;
808 g_src.meta.closed = n == 0;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100809 break;
810 } else if (errno != EINTR) {
811 return strerror(errno);
812 }
Nigel Tao1b073492020-02-16 22:11:36 +1100813 }
814 return nullptr;
815}
816
Nigel Tao2914bae2020-02-26 09:40:30 +1100817const char* //
818flush_dst() {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100819 while (true) {
Nigel Taod60815c2020-03-26 14:32:35 +1100820 size_t n = g_dst.meta.wi - g_dst.meta.ri;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100821 if (n == 0) {
822 break;
Nigel Tao1b073492020-02-16 22:11:36 +1100823 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100824 const int stdout_fd = 1;
Nigel Taod60815c2020-03-26 14:32:35 +1100825 ssize_t i = write(stdout_fd, g_dst.data.ptr + g_dst.meta.ri, n);
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100826 if (i >= 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100827 g_dst.meta.ri += i;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100828 } else if (errno != EINTR) {
829 return strerror(errno);
830 }
Nigel Tao1b073492020-02-16 22:11:36 +1100831 }
Nigel Taod60815c2020-03-26 14:32:35 +1100832 g_dst.compact();
Nigel Tao1b073492020-02-16 22:11:36 +1100833 return nullptr;
834}
835
Nigel Tao2914bae2020-02-26 09:40:30 +1100836const char* //
837write_dst(const void* s, size_t n) {
Nigel Taod60815c2020-03-26 14:32:35 +1100838 if (g_suppress_write_dst > 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100839 return nullptr;
840 }
Nigel Tao1b073492020-02-16 22:11:36 +1100841 const uint8_t* p = static_cast<const uint8_t*>(s);
842 while (n > 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100843 size_t i = g_dst.writer_available();
Nigel Tao1b073492020-02-16 22:11:36 +1100844 if (i == 0) {
845 const char* z = flush_dst();
846 if (z) {
847 return z;
848 }
Nigel Taod60815c2020-03-26 14:32:35 +1100849 i = g_dst.writer_available();
Nigel Tao1b073492020-02-16 22:11:36 +1100850 if (i == 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100851 return "main: g_dst buffer is full";
Nigel Tao1b073492020-02-16 22:11:36 +1100852 }
853 }
854
855 if (i > n) {
856 i = n;
857 }
Nigel Taod60815c2020-03-26 14:32:35 +1100858 memcpy(g_dst.data.ptr + g_dst.meta.wi, p, i);
859 g_dst.meta.wi += i;
Nigel Tao1b073492020-02-16 22:11:36 +1100860 p += i;
861 n -= i;
Nigel Taod60815c2020-03-26 14:32:35 +1100862 g_wrote_to_dst = true;
Nigel Tao1b073492020-02-16 22:11:36 +1100863 }
864 return nullptr;
865}
866
867// ----
868
Nigel Tao168f60a2020-07-14 13:19:33 +1000869const char* //
870write_literal(uint64_t vbd) {
871 const char* ptr = nullptr;
872 size_t len = 0;
873 if (vbd & WUFFS_BASE__TOKEN__VBD__LITERAL__UNDEFINED) {
874 if (g_flags.output_format == file_format::json) {
875 ptr = "null"; // JSON's closest approximation to "undefined".
876 len = 4;
877 } else {
878 ptr = "\xF7";
879 len = 1;
880 }
881 } else if (vbd & WUFFS_BASE__TOKEN__VBD__LITERAL__NULL) {
882 if (g_flags.output_format == file_format::json) {
883 ptr = "null";
884 len = 4;
885 } else {
886 ptr = "\xF6";
887 len = 1;
888 }
889 } else if (vbd & WUFFS_BASE__TOKEN__VBD__LITERAL__FALSE) {
890 if (g_flags.output_format == file_format::json) {
891 ptr = "false";
892 len = 5;
893 } else {
894 ptr = "\xF4";
895 len = 1;
896 }
897 } else if (vbd & WUFFS_BASE__TOKEN__VBD__LITERAL__TRUE) {
898 if (g_flags.output_format == file_format::json) {
899 ptr = "true";
900 len = 4;
901 } else {
902 ptr = "\xF5";
903 len = 1;
904 }
905 } else {
906 return "main: internal error: unexpected write_literal argument";
907 }
908 return write_dst(ptr, len);
909}
910
911// ----
912
913const char* //
914write_number_cbor_f64(double f) {
915 uint8_t buf[9];
916 wuffs_base__lossy_value_u16 lv16 =
917 wuffs_base__ieee_754_bit_representation__from_f64_to_u16_truncate(f);
918 if (!lv16.lossy) {
919 buf[0] = 0xF9;
920 wuffs_base__store_u16be__no_bounds_check(&buf[1], lv16.value);
921 return write_dst(&buf[0], 3);
922 }
923 wuffs_base__lossy_value_u32 lv32 =
924 wuffs_base__ieee_754_bit_representation__from_f64_to_u32_truncate(f);
925 if (!lv32.lossy) {
926 buf[0] = 0xFA;
927 wuffs_base__store_u32be__no_bounds_check(&buf[1], lv32.value);
928 return write_dst(&buf[0], 5);
929 }
930 buf[0] = 0xFB;
931 wuffs_base__store_u64be__no_bounds_check(
932 &buf[1], wuffs_base__ieee_754_bit_representation__from_f64_to_u64(f));
933 return write_dst(&buf[0], 9);
934}
935
936const char* //
937write_number_cbor_u64(uint8_t base, uint64_t u) {
938 uint8_t buf[9];
939 if (u < 0x18) {
940 buf[0] = base | ((uint8_t)u);
941 return write_dst(&buf[0], 1);
942 } else if ((u >> 8) == 0) {
943 buf[0] = base | 0x18;
944 buf[1] = ((uint8_t)u);
945 return write_dst(&buf[0], 2);
946 } else if ((u >> 16) == 0) {
947 buf[0] = base | 0x19;
948 wuffs_base__store_u16be__no_bounds_check(&buf[1], ((uint16_t)u));
949 return write_dst(&buf[0], 3);
950 } else if ((u >> 32) == 0) {
951 buf[0] = base | 0x1A;
952 wuffs_base__store_u32be__no_bounds_check(&buf[1], ((uint32_t)u));
953 return write_dst(&buf[0], 5);
954 }
955 buf[0] = base | 0x1B;
956 wuffs_base__store_u64be__no_bounds_check(&buf[1], u);
957 return write_dst(&buf[0], 9);
958}
959
960const char* //
961write_number(uint64_t vbd, uint8_t* ptr, size_t len) {
Nigel Tao4e193592020-07-15 12:48:57 +1000962 if (g_flags.output_format == file_format::json) {
963 if (vbd & WUFFS_BASE__TOKEN__VBD__NUMBER__FORMAT_TEXT) {
Nigel Tao168f60a2020-07-14 13:19:33 +1000964 return write_dst(ptr, len);
Nigel Tao4e193592020-07-15 12:48:57 +1000965 } else if ((vbd &
966 WUFFS_BASE__TOKEN__VBD__NUMBER__CONTENT_INTEGER_UNSIGNED) &&
967 (vbd &
968 WUFFS_BASE__TOKEN__VBD__NUMBER__FORMAT_BINARY_BIG_ENDIAN)) {
969 if (vbd & WUFFS_BASE__TOKEN__VBD__NUMBER__FORMAT_IGNORE_FIRST_BYTE) {
970 if (len == 0) {
971 goto fail;
972 }
973 ptr++;
974 len--;
975 }
976 uint64_t u;
977 switch (len) {
978 case 1:
979 u = wuffs_base__load_u8__no_bounds_check(ptr);
980 break;
981 case 2:
982 u = wuffs_base__load_u16be__no_bounds_check(ptr);
983 break;
984 case 4:
985 u = wuffs_base__load_u32be__no_bounds_check(ptr);
986 break;
987 case 8:
988 u = wuffs_base__load_u64be__no_bounds_check(ptr);
989 break;
990 default:
991 goto fail;
992 }
993 uint8_t buf[WUFFS_BASE__U64__BYTE_LENGTH__MAX_INCL];
994 size_t n = wuffs_base__render_number_u64(
995 wuffs_base__make_slice_u8(&buf[0], sizeof buf), u,
996 WUFFS_BASE__RENDER_NUMBER_XXX__DEFAULT_OPTIONS);
997 return write_dst(&buf[0], n);
Nigel Tao168f60a2020-07-14 13:19:33 +1000998 }
999
Nigel Tao4e193592020-07-15 12:48:57 +10001000 // From here on, (g_flags.output_format == file_format::cbor).
1001 } else if (vbd & WUFFS_BASE__TOKEN__VBD__NUMBER__FORMAT_BINARY_BIG_ENDIAN) {
1002 return write_dst(ptr, len);
1003 } else if (vbd & WUFFS_BASE__TOKEN__VBD__NUMBER__FORMAT_TEXT) {
Nigel Tao168f60a2020-07-14 13:19:33 +10001004 // First try to parse (ptr, len) as an integer. Something like
1005 // "1180591620717411303424" is a valid number (in the JSON sense) but will
1006 // overflow int64_t or uint64_t, so fall back to parsing it as a float64.
1007 if (vbd & WUFFS_BASE__TOKEN__VBD__NUMBER__CONTENT_INTEGER_SIGNED) {
1008 if ((len > 0) && (ptr[0] == '-')) {
1009 wuffs_base__result_i64 ri = wuffs_base__parse_number_i64(
1010 wuffs_base__make_slice_u8(ptr, len),
1011 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
1012 if (ri.status.is_ok()) {
1013 return write_number_cbor_u64(0x20, ~ri.value);
1014 }
1015 } else {
1016 wuffs_base__result_u64 ru = wuffs_base__parse_number_u64(
1017 wuffs_base__make_slice_u8(ptr, len),
1018 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
1019 if (ru.status.is_ok()) {
1020 return write_number_cbor_u64(0x00, ru.value);
1021 }
1022 }
1023 }
1024
1025 if (vbd & WUFFS_BASE__TOKEN__VBD__NUMBER__CONTENT_FLOATING_POINT) {
1026 wuffs_base__result_f64 rf = wuffs_base__parse_number_f64(
1027 wuffs_base__make_slice_u8(ptr, len),
1028 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
1029 if (rf.status.is_ok()) {
1030 return write_number_cbor_f64(rf.value);
1031 }
1032 }
1033 }
1034
Nigel Tao4e193592020-07-15 12:48:57 +10001035fail:
Nigel Tao168f60a2020-07-14 13:19:33 +10001036 return "main: internal error: unexpected write_number argument";
1037}
1038
Nigel Tao4e193592020-07-15 12:48:57 +10001039const char* //
1040write_inline_integer(uint64_t vbd, uint8_t* ptr, size_t len) {
1041 if (g_flags.output_format == file_format::cbor) {
1042 return write_dst(ptr, len);
1043 }
1044
1045 uint8_t buf[WUFFS_BASE__I64__BYTE_LENGTH__MAX_INCL];
1046 size_t n = wuffs_base__render_number_i64(
1047 wuffs_base__make_slice_u8(&buf[0], sizeof buf), (int16_t)vbd,
1048 WUFFS_BASE__RENDER_NUMBER_XXX__DEFAULT_OPTIONS);
1049 return write_dst(&buf[0], n);
1050}
1051
Nigel Tao168f60a2020-07-14 13:19:33 +10001052// ----
1053
Nigel Tao2914bae2020-02-26 09:40:30 +11001054uint8_t //
1055hex_digit(uint8_t nibble) {
Nigel Taob5461bd2020-02-21 14:13:37 +11001056 nibble &= 0x0F;
1057 if (nibble <= 9) {
1058 return '0' + nibble;
1059 }
1060 return ('A' - 10) + nibble;
1061}
1062
Nigel Tao2914bae2020-02-26 09:40:30 +11001063const char* //
Nigel Tao168f60a2020-07-14 13:19:33 +10001064flush_cbor_output_string() {
1065 uint8_t prefix[3];
1066 prefix[0] = g_cbor_output_string_is_utf_8 ? 0x60 : 0x40;
1067 if (g_cbor_output_string_length < 0x18) {
1068 prefix[0] |= g_cbor_output_string_length;
1069 TRY(write_dst(&prefix[0], 1));
1070 } else if (g_cbor_output_string_length <= 0xFF) {
1071 prefix[0] |= 0x18;
1072 prefix[1] = g_cbor_output_string_length;
1073 TRY(write_dst(&prefix[0], 2));
1074 } else if (g_cbor_output_string_length <= 0xFFFF) {
1075 prefix[0] |= 0x19;
1076 prefix[1] = g_cbor_output_string_length >> 8;
1077 prefix[2] = g_cbor_output_string_length;
1078 TRY(write_dst(&prefix[0], 3));
1079 } else {
1080 return "main: internal error: CBOR string output is too long";
1081 }
1082
1083 size_t n = g_cbor_output_string_length;
1084 g_cbor_output_string_length = 0;
1085 return write_dst(&g_cbor_output_string_array[0], n);
1086}
1087
1088const char* //
1089write_cbor_output_string(uint8_t* ptr, size_t len, bool finish) {
1090 // Check that g_cbor_output_string_array can hold any UTF-8 code point.
1091 if (CBOR_OUTPUT_STRING_ARRAY_SIZE < 4) {
1092 return "main: internal error: CBOR_OUTPUT_STRING_ARRAY_SIZE is too short";
1093 }
1094
1095 while (len > 0) {
1096 size_t available =
1097 CBOR_OUTPUT_STRING_ARRAY_SIZE - g_cbor_output_string_length;
1098 if (available >= len) {
1099 memcpy(&g_cbor_output_string_array[g_cbor_output_string_length], ptr,
1100 len);
1101 g_cbor_output_string_length += len;
1102 ptr += len;
1103 len = 0;
1104 break;
1105
1106 } else if (available > 0) {
1107 if (!g_cbor_output_string_is_multiple_chunks) {
1108 g_cbor_output_string_is_multiple_chunks = true;
1109 TRY(write_dst(g_cbor_output_string_is_utf_8 ? "\x7F" : "\x5F", 1));
Nigel Tao3b486982020-02-27 15:05:59 +11001110 }
Nigel Tao168f60a2020-07-14 13:19:33 +10001111
1112 if (g_cbor_output_string_is_utf_8) {
1113 // Walk the end backwards to a UTF-8 boundary, so that each chunk of
1114 // the multi-chunk string is also valid UTF-8.
1115 while (available > 0) {
1116 wuffs_base__utf_8__next__output o = wuffs_base__utf_8__next_from_end(
1117 wuffs_base__make_slice_u8(ptr, available));
1118 if ((o.code_point != WUFFS_BASE__UNICODE_REPLACEMENT_CHARACTER) ||
1119 (o.byte_length != 1)) {
1120 break;
1121 }
1122 available--;
1123 }
1124 }
1125
1126 memcpy(&g_cbor_output_string_array[g_cbor_output_string_length], ptr,
1127 available);
1128 g_cbor_output_string_length += available;
1129 ptr += available;
1130 len -= available;
Nigel Tao3b486982020-02-27 15:05:59 +11001131 }
1132
Nigel Tao168f60a2020-07-14 13:19:33 +10001133 TRY(flush_cbor_output_string());
1134 }
Nigel Taob9ad34f2020-03-03 12:44:01 +11001135
Nigel Tao168f60a2020-07-14 13:19:33 +10001136 if (finish) {
1137 TRY(flush_cbor_output_string());
1138 if (g_cbor_output_string_is_multiple_chunks) {
1139 TRY(write_dst("\xFF", 1));
1140 }
1141 }
1142 return nullptr;
1143}
Nigel Taob9ad34f2020-03-03 12:44:01 +11001144
Nigel Tao168f60a2020-07-14 13:19:33 +10001145const char* //
1146handle_string(uint64_t vbd,
1147 uint64_t len,
1148 bool start_of_token_chain,
1149 bool continued) {
1150 if (start_of_token_chain) {
1151 if (g_flags.output_format == file_format::json) {
1152 TRY(write_dst("\"", 1));
1153 } else {
1154 g_cbor_output_string_length = 0;
1155 g_cbor_output_string_is_multiple_chunks = false;
1156 g_cbor_output_string_is_utf_8 =
1157 vbd & WUFFS_BASE__TOKEN__VBD__STRING__CHAIN_MUST_BE_UTF_8;
1158 }
1159 g_query.restart_fragment(in_dict_before_key() && g_query.is_at(g_depth));
1160 }
1161
1162 if (vbd & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_0_DST_1_SRC_DROP) {
1163 // No-op.
1164 } else if (vbd & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) {
1165 uint8_t* ptr = g_src.data.ptr + g_curr_token_end_src_index - len;
1166 if (g_flags.output_format == file_format::json) {
1167 // TODO: if the input is CBOR but the output is JSON then we have to
1168 // escape '\n', '\"', etc.
1169 TRY(write_dst(ptr, len));
1170 } else {
1171 TRY(write_cbor_output_string(ptr, len, false));
1172 }
1173 g_query.incremental_match_slice(ptr, len);
Nigel Taob9ad34f2020-03-03 12:44:01 +11001174 } else {
Nigel Tao168f60a2020-07-14 13:19:33 +10001175 return "main: internal error: unexpected string-token conversion";
1176 }
1177
1178 if (continued) {
1179 return nullptr;
1180 }
1181
1182 if (g_flags.output_format == file_format::json) {
1183 TRY(write_dst("\"", 1));
1184 } else {
1185 TRY(write_cbor_output_string(nullptr, 0, true));
1186 }
1187 return nullptr;
1188}
1189
1190const char* //
1191handle_unicode_code_point(uint32_t ucp) {
1192 if (g_flags.output_format == file_format::json) {
1193 if (ucp < 0x0020) {
1194 switch (ucp) {
1195 case '\b':
1196 return write_dst("\\b", 2);
1197 case '\f':
1198 return write_dst("\\f", 2);
1199 case '\n':
1200 return write_dst("\\n", 2);
1201 case '\r':
1202 return write_dst("\\r", 2);
1203 case '\t':
1204 return write_dst("\\t", 2);
1205 }
1206
1207 // Other bytes less than 0x0020 are valid UTF-8 but not valid in a
1208 // JSON string. They need to remain escaped.
1209 uint8_t esc6[6];
1210 esc6[0] = '\\';
1211 esc6[1] = 'u';
1212 esc6[2] = '0';
1213 esc6[3] = '0';
1214 esc6[4] = hex_digit(ucp >> 4);
1215 esc6[5] = hex_digit(ucp >> 0);
1216 return write_dst(&esc6[0], 6);
1217
1218 } else if (ucp == '\"') {
1219 return write_dst("\\\"", 2);
1220
1221 } else if (ucp == '\\') {
1222 return write_dst("\\\\", 2);
Nigel Tao3b486982020-02-27 15:05:59 +11001223 }
Nigel Tao3b486982020-02-27 15:05:59 +11001224 }
1225
Nigel Tao168f60a2020-07-14 13:19:33 +10001226 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
1227 size_t n = wuffs_base__utf_8__encode(
1228 wuffs_base__make_slice_u8(&u[0],
1229 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
1230 ucp);
1231 if (n == 0) {
1232 return "main: internal error: unexpected Unicode code point";
1233 }
1234
1235 if (g_flags.output_format == file_format::json) {
1236 return write_dst(&u[0], n);
1237 }
1238 return write_cbor_output_string(&u[0], n, false);
Nigel Tao3b486982020-02-27 15:05:59 +11001239}
1240
1241const char* //
Nigel Tao2ef39992020-04-09 17:24:39 +10001242handle_token(wuffs_base__token t, bool start_of_token_chain) {
Nigel Tao2cf76db2020-02-27 22:42:01 +11001243 do {
Nigel Tao462f8662020-04-01 23:01:51 +11001244 int64_t vbc = t.value_base_category();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001245 uint64_t vbd = t.value_base_detail();
1246 uint64_t len = t.length();
Nigel Tao1b073492020-02-16 22:11:36 +11001247
1248 // Handle ']' or '}'.
Nigel Tao9f7a2502020-02-23 09:42:02 +11001249 if ((vbc == WUFFS_BASE__TOKEN__VBC__STRUCTURE) &&
Nigel Tao2cf76db2020-02-27 22:42:01 +11001250 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__POP)) {
Nigel Taod60815c2020-03-26 14:32:35 +11001251 if (g_query.is_at(g_depth)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001252 return "main: no match for query";
1253 }
Nigel Taod60815c2020-03-26 14:32:35 +11001254 if (g_depth <= 0) {
1255 return "main: internal error: inconsistent g_depth";
Nigel Tao1b073492020-02-16 22:11:36 +11001256 }
Nigel Taod60815c2020-03-26 14:32:35 +11001257 g_depth--;
Nigel Tao1b073492020-02-16 22:11:36 +11001258
Nigel Taod60815c2020-03-26 14:32:35 +11001259 if (g_query.matched_all() && (g_depth >= g_flags.max_output_depth)) {
1260 g_suppress_write_dst--;
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001261 // '…' is U+2026 HORIZONTAL ELLIPSIS, which is 3 UTF-8 bytes.
Nigel Tao168f60a2020-07-14 13:19:33 +10001262 if (g_flags.output_format == file_format::json) {
1263 TRY(write_dst((vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST)
1264 ? "\"[…]\""
1265 : "\"{…}\"",
1266 7));
1267 } else {
1268 TRY(write_dst((vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST)
1269 ? "\x65[…]"
1270 : "\x65{…}",
1271 6));
1272 }
1273 } else if (g_flags.output_format == file_format::json) {
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001274 // Write preceding whitespace.
Nigel Taod60815c2020-03-26 14:32:35 +11001275 if ((g_ctx != context::in_list_after_bracket) &&
1276 (g_ctx != context::in_dict_after_brace) &&
1277 !g_flags.compact_output) {
Nigel Taoc766bb72020-07-09 12:59:32 +10001278 if (g_flags.output_json_extra_comma) {
1279 TRY(write_dst(",\n", 2));
1280 } else {
1281 TRY(write_dst("\n", 1));
1282 }
Nigel Taod60815c2020-03-26 14:32:35 +11001283 for (uint32_t i = 0; i < g_depth; i++) {
1284 TRY(write_dst(
1285 g_flags.tabs ? INDENT_TAB_STRING : INDENT_SPACES_STRING,
Nigel Taoecadf722020-07-13 08:22:34 +10001286 g_flags.tabs ? 1 : g_flags.spaces));
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001287 }
Nigel Tao1b073492020-02-16 22:11:36 +11001288 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001289
1290 TRY(write_dst(
1291 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST) ? "]" : "}",
1292 1));
Nigel Tao168f60a2020-07-14 13:19:33 +10001293 } else {
1294 TRY(write_dst("\xFF", 1));
Nigel Tao1b073492020-02-16 22:11:36 +11001295 }
1296
Nigel Taod60815c2020-03-26 14:32:35 +11001297 g_ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
1298 ? context::in_list_after_value
1299 : context::in_dict_after_key;
Nigel Tao1b073492020-02-16 22:11:36 +11001300 goto after_value;
1301 }
1302
Nigel Taod1c928a2020-02-28 12:43:53 +11001303 // Write preceding whitespace and punctuation, if it wasn't ']', '}' or a
1304 // continuation of a multi-token chain.
Nigel Tao2ef39992020-04-09 17:24:39 +10001305 if (start_of_token_chain) {
Nigel Tao168f60a2020-07-14 13:19:33 +10001306 if (g_flags.output_format != file_format::json) {
1307 // No-op.
1308 } else if (g_ctx == context::in_dict_after_key) {
Nigel Taod60815c2020-03-26 14:32:35 +11001309 TRY(write_dst(": ", g_flags.compact_output ? 1 : 2));
1310 } else if (g_ctx != context::none) {
1311 if ((g_ctx != context::in_list_after_bracket) &&
1312 (g_ctx != context::in_dict_after_brace)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001313 TRY(write_dst(",", 1));
Nigel Tao107f0ef2020-03-01 21:35:02 +11001314 }
Nigel Taod60815c2020-03-26 14:32:35 +11001315 if (!g_flags.compact_output) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001316 TRY(write_dst("\n", 1));
Nigel Taod60815c2020-03-26 14:32:35 +11001317 for (size_t i = 0; i < g_depth; i++) {
1318 TRY(write_dst(
1319 g_flags.tabs ? INDENT_TAB_STRING : INDENT_SPACES_STRING,
Nigel Taoecadf722020-07-13 08:22:34 +10001320 g_flags.tabs ? 1 : g_flags.spaces));
Nigel Tao0cd2f982020-03-03 23:03:02 +11001321 }
1322 }
1323 }
1324
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001325 bool query_matched_fragment = false;
Nigel Taod60815c2020-03-26 14:32:35 +11001326 if (g_query.is_at(g_depth)) {
1327 switch (g_ctx) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001328 case context::in_list_after_bracket:
1329 case context::in_list_after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001330 query_matched_fragment = g_query.tick();
Nigel Tao0cd2f982020-03-03 23:03:02 +11001331 break;
1332 case context::in_dict_after_key:
Nigel Taod60815c2020-03-26 14:32:35 +11001333 query_matched_fragment = g_query.matched_fragment();
Nigel Tao0cd2f982020-03-03 23:03:02 +11001334 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +11001335 default:
1336 break;
Nigel Tao0cd2f982020-03-03 23:03:02 +11001337 }
1338 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001339 if (!query_matched_fragment) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001340 // No-op.
Nigel Taod60815c2020-03-26 14:32:35 +11001341 } else if (!g_query.next_fragment()) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001342 // There is no next fragment. We have matched the complete query, and
1343 // the upcoming JSON value is the result of that query.
1344 //
Nigel Taod60815c2020-03-26 14:32:35 +11001345 // Un-suppress writing to stdout and reset the g_ctx and g_depth as if
1346 // we were about to decode a top-level value. This makes any subsequent
1347 // indentation be relative to this point, and we will return g_eod
1348 // after the upcoming JSON value is complete.
1349 if (g_suppress_write_dst != 1) {
1350 return "main: internal error: inconsistent g_suppress_write_dst";
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001351 }
Nigel Taod60815c2020-03-26 14:32:35 +11001352 g_suppress_write_dst = 0;
1353 g_ctx = context::none;
1354 g_depth = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +11001355 } else if ((vbc != WUFFS_BASE__TOKEN__VBC__STRUCTURE) ||
1356 !(vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__PUSH)) {
1357 // The query has moved on to the next fragment but the upcoming JSON
1358 // value is not a container.
1359 return "main: no match for query";
Nigel Tao1b073492020-02-16 22:11:36 +11001360 }
1361 }
1362
1363 // Handle the token itself: either a container ('[' or '{') or a simple
Nigel Tao85fba7f2020-02-29 16:28:06 +11001364 // value: string (a chain of raw or escaped parts), literal or number.
Nigel Tao1b073492020-02-16 22:11:36 +11001365 switch (vbc) {
Nigel Tao85fba7f2020-02-29 16:28:06 +11001366 case WUFFS_BASE__TOKEN__VBC__STRUCTURE:
Nigel Taod60815c2020-03-26 14:32:35 +11001367 if (g_query.matched_all() && (g_depth >= g_flags.max_output_depth)) {
1368 g_suppress_write_dst++;
Nigel Tao168f60a2020-07-14 13:19:33 +10001369 } else if (g_flags.output_format == file_format::json) {
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001370 TRY(write_dst(
1371 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST) ? "[" : "{",
1372 1));
Nigel Tao168f60a2020-07-14 13:19:33 +10001373 } else {
1374 TRY(write_dst((vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
1375 ? "\x9F"
1376 : "\xBF",
1377 1));
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001378 }
Nigel Taod60815c2020-03-26 14:32:35 +11001379 g_depth++;
1380 g_ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
1381 ? context::in_list_after_bracket
1382 : context::in_dict_after_brace;
Nigel Tao85fba7f2020-02-29 16:28:06 +11001383 return nullptr;
1384
Nigel Tao2cf76db2020-02-27 22:42:01 +11001385 case WUFFS_BASE__TOKEN__VBC__STRING:
Nigel Tao168f60a2020-07-14 13:19:33 +10001386 TRY(handle_string(vbd, len, start_of_token_chain, t.continued()));
Nigel Tao496e88b2020-04-09 22:10:08 +10001387 if (t.continued()) {
Nigel Tao2cf76db2020-02-27 22:42:01 +11001388 return nullptr;
1389 }
Nigel Tao2cf76db2020-02-27 22:42:01 +11001390 goto after_value;
1391
1392 case WUFFS_BASE__TOKEN__VBC__UNICODE_CODE_POINT:
Nigel Tao496e88b2020-04-09 22:10:08 +10001393 if (!t.continued()) {
1394 return "main: internal error: unexpected non-continued UCP token";
Nigel Tao0cd2f982020-03-03 23:03:02 +11001395 }
1396 TRY(handle_unicode_code_point(vbd));
Nigel Taod60815c2020-03-26 14:32:35 +11001397 g_query.incremental_match_code_point(vbd);
Nigel Tao0cd2f982020-03-03 23:03:02 +11001398 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001399
Nigel Tao85fba7f2020-02-29 16:28:06 +11001400 case WUFFS_BASE__TOKEN__VBC__LITERAL:
Nigel Tao168f60a2020-07-14 13:19:33 +10001401 TRY(write_literal(vbd));
1402 goto after_value;
1403
Nigel Tao2cf76db2020-02-27 22:42:01 +11001404 case WUFFS_BASE__TOKEN__VBC__NUMBER:
Nigel Tao168f60a2020-07-14 13:19:33 +10001405 TRY(write_number(vbd, g_src.data.ptr + g_curr_token_end_src_index - len,
1406 len));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001407 goto after_value;
Nigel Tao4e193592020-07-15 12:48:57 +10001408
1409 case WUFFS_BASE__TOKEN__VBC__INLINE_INTEGER:
1410 TRY(write_inline_integer(
1411 vbd, g_src.data.ptr + g_curr_token_end_src_index - len, len));
1412 goto after_value;
Nigel Tao1b073492020-02-16 22:11:36 +11001413 }
1414
1415 // Return an error if we didn't match the (vbc, vbd) pair.
Nigel Tao2cf76db2020-02-27 22:42:01 +11001416 return "main: internal error: unexpected token";
1417 } while (0);
Nigel Tao1b073492020-02-16 22:11:36 +11001418
Nigel Tao2cf76db2020-02-27 22:42:01 +11001419 // Book-keeping after completing a value (whether a container value or a
1420 // simple value). Empty parent containers are no longer empty. If the parent
1421 // container is a "{...}" object, toggle between keys and values.
1422after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001423 if (g_depth == 0) {
1424 return g_eod;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001425 }
Nigel Taod60815c2020-03-26 14:32:35 +11001426 switch (g_ctx) {
Nigel Tao2cf76db2020-02-27 22:42:01 +11001427 case context::in_list_after_bracket:
Nigel Taod60815c2020-03-26 14:32:35 +11001428 g_ctx = context::in_list_after_value;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001429 break;
1430 case context::in_dict_after_brace:
Nigel Taod60815c2020-03-26 14:32:35 +11001431 g_ctx = context::in_dict_after_key;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001432 break;
1433 case context::in_dict_after_key:
Nigel Taod60815c2020-03-26 14:32:35 +11001434 g_ctx = context::in_dict_after_value;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001435 break;
1436 case context::in_dict_after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001437 g_ctx = context::in_dict_after_key;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001438 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +11001439 default:
1440 break;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001441 }
1442 return nullptr;
1443}
1444
1445const char* //
1446main1(int argc, char** argv) {
1447 TRY(initialize_globals(argc, argv));
1448
Nigel Taocd183f92020-07-14 12:11:05 +10001449 bool start_of_token_chain = true;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001450 while (true) {
Nigel Tao4e193592020-07-15 12:48:57 +10001451 wuffs_base__status status = g_dec->decode_tokens(
Nigel Taod60815c2020-03-26 14:32:35 +11001452 &g_tok, &g_src,
1453 wuffs_base__make_slice_u8(g_work_buffer_array, WORK_BUFFER_ARRAY_SIZE));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001454
Nigel Taod60815c2020-03-26 14:32:35 +11001455 while (g_tok.meta.ri < g_tok.meta.wi) {
1456 wuffs_base__token t = g_tok.data.ptr[g_tok.meta.ri++];
Nigel Tao2cf76db2020-02-27 22:42:01 +11001457 uint64_t n = t.length();
Nigel Taod60815c2020-03-26 14:32:35 +11001458 if ((g_src.meta.ri - g_curr_token_end_src_index) < n) {
1459 return "main: internal error: inconsistent g_src indexes";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001460 }
Nigel Taod60815c2020-03-26 14:32:35 +11001461 g_curr_token_end_src_index += n;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001462
Nigel Taod0b16cb2020-03-14 10:15:54 +11001463 // Skip filler tokens (e.g. whitespace).
Nigel Tao2cf76db2020-02-27 22:42:01 +11001464 if (t.value() == 0) {
Nigel Tao496e88b2020-04-09 22:10:08 +10001465 start_of_token_chain = !t.continued();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001466 continue;
1467 }
1468
Nigel Tao2ef39992020-04-09 17:24:39 +10001469 const char* z = handle_token(t, start_of_token_chain);
Nigel Tao496e88b2020-04-09 22:10:08 +10001470 start_of_token_chain = !t.continued();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001471 if (z == nullptr) {
1472 continue;
Nigel Taod60815c2020-03-26 14:32:35 +11001473 } else if (z == g_eod) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001474 goto end_of_data;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001475 }
1476 return z;
Nigel Tao1b073492020-02-16 22:11:36 +11001477 }
Nigel Tao2cf76db2020-02-27 22:42:01 +11001478
1479 if (status.repr == nullptr) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001480 return "main: internal error: unexpected end of token stream";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001481 } else if (status.repr == wuffs_base__suspension__short_read) {
Nigel Taod60815c2020-03-26 14:32:35 +11001482 if (g_curr_token_end_src_index != g_src.meta.ri) {
1483 return "main: internal error: inconsistent g_src indexes";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001484 }
1485 TRY(read_src());
Nigel Taod60815c2020-03-26 14:32:35 +11001486 g_curr_token_end_src_index = g_src.meta.ri;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001487 } else if (status.repr == wuffs_base__suspension__short_write) {
Nigel Taod60815c2020-03-26 14:32:35 +11001488 g_tok.compact();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001489 } else {
1490 return status.message();
Nigel Tao1b073492020-02-16 22:11:36 +11001491 }
1492 }
Nigel Tao0cd2f982020-03-03 23:03:02 +11001493end_of_data:
1494
Nigel Taod60815c2020-03-26 14:32:35 +11001495 // With a non-empty g_query, don't try to consume trailing whitespace or
Nigel Tao0cd2f982020-03-03 23:03:02 +11001496 // confirm that we've processed all the tokens.
Nigel Taod60815c2020-03-26 14:32:35 +11001497 if (g_flags.query_c_string && *g_flags.query_c_string) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001498 return nullptr;
1499 }
Nigel Tao6b161af2020-02-24 11:01:48 +11001500
Nigel Tao6b161af2020-02-24 11:01:48 +11001501 // Check that we've exhausted the input.
Nigel Taod60815c2020-03-26 14:32:35 +11001502 if ((g_src.meta.ri == g_src.meta.wi) && !g_src.meta.closed) {
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001503 TRY(read_src());
1504 }
Nigel Taod60815c2020-03-26 14:32:35 +11001505 if ((g_src.meta.ri < g_src.meta.wi) || !g_src.meta.closed) {
Nigel Tao6b161af2020-02-24 11:01:48 +11001506 return "main: valid JSON followed by further (unexpected) data";
1507 }
1508
1509 // Check that we've used all of the decoded tokens, other than trailing
Nigel Tao4b186b02020-03-18 14:25:21 +11001510 // filler tokens. For example, "true\n" is valid JSON (and fully consumed
1511 // with WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE enabled) with a trailing
1512 // filler token for the "\n".
Nigel Taod60815c2020-03-26 14:32:35 +11001513 for (; g_tok.meta.ri < g_tok.meta.wi; g_tok.meta.ri++) {
1514 if (g_tok.data.ptr[g_tok.meta.ri].value_base_category() !=
Nigel Tao6b161af2020-02-24 11:01:48 +11001515 WUFFS_BASE__TOKEN__VBC__FILLER) {
1516 return "main: internal error: decoded OK but unprocessed tokens remain";
1517 }
1518 }
1519
1520 return nullptr;
Nigel Tao1b073492020-02-16 22:11:36 +11001521}
1522
Nigel Tao2914bae2020-02-26 09:40:30 +11001523int //
1524compute_exit_code(const char* status_msg) {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001525 if (!status_msg) {
1526 return 0;
1527 }
Nigel Tao01abc842020-03-06 21:42:33 +11001528 size_t n;
Nigel Taod60815c2020-03-26 14:32:35 +11001529 if (status_msg == g_usage) {
Nigel Tao01abc842020-03-06 21:42:33 +11001530 n = strlen(status_msg);
1531 } else {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001532 n = strnlen(status_msg, 2047);
Nigel Tao01abc842020-03-06 21:42:33 +11001533 if (n >= 2047) {
1534 status_msg = "main: internal error: error message is too long";
1535 n = strnlen(status_msg, 2047);
1536 }
Nigel Tao9cc2c252020-02-23 17:05:49 +11001537 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001538 const int stderr_fd = 2;
1539 ignore_return_value(write(stderr_fd, status_msg, n));
1540 ignore_return_value(write(stderr_fd, "\n", 1));
Nigel Tao9cc2c252020-02-23 17:05:49 +11001541 // Return an exit code of 1 for regular (forseen) errors, e.g. badly
1542 // formatted or unsupported input.
1543 //
1544 // Return an exit code of 2 for internal (exceptional) errors, e.g. defensive
1545 // run-time checks found that an internal invariant did not hold.
1546 //
1547 // Automated testing, including badly formatted inputs, can therefore
1548 // discriminate between expected failure (exit code 1) and unexpected failure
1549 // (other non-zero exit codes). Specifically, exit code 2 for internal
1550 // invariant violation, exit code 139 (which is 128 + SIGSEGV on x86_64
1551 // linux) for a segmentation fault (e.g. null pointer dereference).
1552 return strstr(status_msg, "internal error:") ? 2 : 1;
1553}
1554
Nigel Tao2914bae2020-02-26 09:40:30 +11001555int //
1556main(int argc, char** argv) {
Nigel Tao01abc842020-03-06 21:42:33 +11001557 // Look for an input filename (the first non-flag argument) in argv. If there
1558 // is one, open it (but do not read from it) before we self-impose a sandbox.
1559 //
1560 // Flags start with "-", unless it comes after a bare "--" arg.
1561 {
1562 bool dash_dash = false;
1563 int a;
1564 for (a = 1; a < argc; a++) {
1565 char* arg = argv[a];
1566 if ((arg[0] == '-') && !dash_dash) {
1567 dash_dash = (arg[1] == '-') && (arg[2] == '\x00');
1568 continue;
1569 }
Nigel Taod60815c2020-03-26 14:32:35 +11001570 g_input_file_descriptor = open(arg, O_RDONLY);
1571 if (g_input_file_descriptor < 0) {
Nigel Tao01abc842020-03-06 21:42:33 +11001572 fprintf(stderr, "%s: %s\n", arg, strerror(errno));
1573 return 1;
1574 }
1575 break;
1576 }
1577 }
1578
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001579#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1580 prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
Nigel Taod60815c2020-03-26 14:32:35 +11001581 g_sandboxed = true;
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001582#endif
1583
Nigel Tao0cd2f982020-03-03 23:03:02 +11001584 const char* z = main1(argc, argv);
Nigel Taod60815c2020-03-26 14:32:35 +11001585 if (g_wrote_to_dst) {
Nigel Tao168f60a2020-07-14 13:19:33 +10001586 const char* z1 = (g_flags.output_format == file_format::json)
1587 ? write_dst("\n", 1)
1588 : nullptr;
Nigel Tao0cd2f982020-03-03 23:03:02 +11001589 const char* z2 = flush_dst();
1590 z = z ? z : (z1 ? z1 : z2);
1591 }
1592 int exit_code = compute_exit_code(z);
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001593
1594#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1595 // Call SYS_exit explicitly, instead of calling SYS_exit_group implicitly by
1596 // either calling _exit or returning from main. SECCOMP_MODE_STRICT allows
1597 // only SYS_exit.
1598 syscall(SYS_exit, exit_code);
1599#endif
Nigel Tao9cc2c252020-02-23 17:05:49 +11001600 return exit_code;
Nigel Tao1b073492020-02-16 22:11:36 +11001601}