blob: 567719fe6d6e134faa99950031934e0afc7bf5a4 [file] [log] [blame]
Nigel Tao1b073492020-02-16 22:11:36 +11001// Copyright 2020 The Wuffs Authors.
2//
3// Licensed under the Apache License, Version 2.0 (the "License");
4// you may not use this file except in compliance with the License.
5// You may obtain a copy of the License at
6//
7// https://www.apache.org/licenses/LICENSE-2.0
8//
9// Unless required by applicable law or agreed to in writing, software
10// distributed under the License is distributed on an "AS IS" BASIS,
11// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12// See the License for the specific language governing permissions and
13// limitations under the License.
14
15// ----------------
16
17/*
Nigel Tao0cd2f982020-03-03 23:03:02 +110018jsonptr is a JSON formatter (pretty-printer) that supports the JSON Pointer
Nigel Tao168f60a2020-07-14 13:19:33 +100019(RFC 6901) query syntax. It reads CBOR or UTF-8 JSON from stdin and writes CBOR
20or canonicalized, formatted UTF-8 JSON to stdout.
Nigel Tao0cd2f982020-03-03 23:03:02 +110021
Nigel Taod60815c2020-03-26 14:32:35 +110022See the "const char* g_usage" string below for details.
Nigel Tao0cd2f982020-03-03 23:03:02 +110023
24----
25
26JSON Pointer (and this program's implementation) is one of many JSON query
27languages and JSON tools, such as jq, jql and JMESPath. This one is relatively
28simple and fewer-featured compared to those others.
29
Nigel Tao168f60a2020-07-14 13:19:33 +100030One benefit of simplicity is that this program's CBOR, JSON and JSON Pointer
Nigel Tao0cd2f982020-03-03 23:03:02 +110031implementations do not dynamically allocate or free memory (yet it does not
32require that the entire input fits in memory at once). They are therefore
33trivially protected against certain bug classes: memory leaks, double-frees and
34use-after-frees.
35
Nigel Tao168f60a2020-07-14 13:19:33 +100036The CBOR and JSON implementations are also written in the Wuffs programming
37language (and then transpiled to C/C++), which is memory-safe (e.g. array
38indexing is bounds-checked) but also prevents integer arithmetic overflows.
Nigel Tao0cd2f982020-03-03 23:03:02 +110039
Nigel Taofe0cbbd2020-03-05 22:01:30 +110040For defense in depth, on Linux, this program also self-imposes a
41SECCOMP_MODE_STRICT sandbox before reading (or otherwise processing) its input
42or writing its output. Under this sandbox, the only permitted system calls are
43read, write, exit and sigreturn.
44
Nigel Tao168f60a2020-07-14 13:19:33 +100045All together, this program aims to safely handle untrusted CBOR or JSON files
46without fear of security bugs such as remote code execution.
Nigel Tao0cd2f982020-03-03 23:03:02 +110047
48----
Nigel Tao1b073492020-02-16 22:11:36 +110049
Nigel Taoc5b3a9e2020-02-24 11:54:35 +110050As of 2020-02-24, this program passes all 318 "test_parsing" cases from the
51JSON test suite (https://github.com/nst/JSONTestSuite), an appendix to the
52"Parsing JSON is a Minefield" article (http://seriot.ch/parsing_json.php) that
53was first published on 2016-10-26 and updated on 2018-03-30.
54
Nigel Tao0cd2f982020-03-03 23:03:02 +110055After modifying this program, run "build-example.sh example/jsonptr/" and then
56"script/run-json-test-suite.sh" to catch correctness regressions.
57
58----
59
Nigel Taod0b16cb2020-03-14 10:15:54 +110060This program uses Wuffs' JSON decoder at a relatively low level, processing the
61decoder's token-stream output individually. The core loop, in pseudo-code, is
62"for_each_token { handle_token(etc); }", where the handle_token function
Nigel Taod60815c2020-03-26 14:32:35 +110063changes global state (e.g. the `g_depth` and `g_ctx` variables) and prints
Nigel Taod0b16cb2020-03-14 10:15:54 +110064output text based on that state and the token's source text. Notably,
65handle_token is not recursive, even though JSON values can nest.
66
67This approach is centered around JSON tokens. Each JSON 'thing' (e.g. number,
68string, object) comprises one or more JSON tokens.
69
70An alternative, higher-level approach is in the sibling example/jsonfindptrs
71program. Neither approach is better or worse per se, but when studying this
72program, be aware that there are multiple ways to use Wuffs' JSON decoder.
73
74The two programs, jsonfindptrs and jsonptr, also demonstrate different
75trade-offs with regard to JSON object duplicate keys. The JSON spec permits
76different implementations to allow or reject duplicate keys. It is not always
77clear which approach is safer. Rejecting them is certainly unambiguous, and
78security bugs can lurk in ambiguous corners of a file format, if two different
79implementations both silently accept a file but differ on how to interpret it.
80On the other hand, in the worst case, detecting duplicate keys requires O(N)
81memory, where N is the size of the (potentially untrusted) input.
82
83This program (jsonptr) allows duplicate keys and requires only O(1) memory. As
84mentioned above, it doesn't dynamically allocate memory at all, and on Linux,
85it runs in a SECCOMP_MODE_STRICT sandbox.
86
87----
88
Nigel Tao1b073492020-02-16 22:11:36 +110089This example program differs from most other example Wuffs programs in that it
90is written in C++, not C.
91
92$CXX jsonptr.cc && ./a.out < ../../test/data/github-tags.json; rm -f a.out
93
94for a C++ compiler $CXX, such as clang++ or g++.
95*/
96
Nigel Tao721190a2020-04-03 22:25:21 +110097#if defined(__cplusplus) && (__cplusplus < 201103L)
98#error "This C++ program requires -std=c++11 or later"
99#endif
100
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100101#include <errno.h>
Nigel Tao01abc842020-03-06 21:42:33 +1100102#include <fcntl.h>
103#include <stdio.h>
Nigel Tao9cc2c252020-02-23 17:05:49 +1100104#include <string.h>
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100105#include <unistd.h>
Nigel Tao1b073492020-02-16 22:11:36 +1100106
107// Wuffs ships as a "single file C library" or "header file library" as per
108// https://github.com/nothings/stb/blob/master/docs/stb_howto.txt
109//
110// To use that single file as a "foo.c"-like implementation, instead of a
111// "foo.h"-like header, #define WUFFS_IMPLEMENTATION before #include'ing or
112// compiling it.
113#define WUFFS_IMPLEMENTATION
114
115// Defining the WUFFS_CONFIG__MODULE* macros are optional, but it lets users of
116// release/c/etc.c whitelist which parts of Wuffs to build. That file contains
117// the entire Wuffs standard library, implementing a variety of codecs and file
118// formats. Without this macro definition, an optimizing compiler or linker may
119// very well discard Wuffs code for unused codecs, but listing the Wuffs
120// modules we use makes that process explicit. Preprocessing means that such
121// code simply isn't compiled.
122#define WUFFS_CONFIG__MODULES
123#define WUFFS_CONFIG__MODULE__BASE
Nigel Tao4e193592020-07-15 12:48:57 +1000124#define WUFFS_CONFIG__MODULE__CBOR
Nigel Tao1b073492020-02-16 22:11:36 +1100125#define WUFFS_CONFIG__MODULE__JSON
126
127// If building this program in an environment that doesn't easily accommodate
128// relative includes, you can use the script/inline-c-relative-includes.go
129// program to generate a stand-alone C++ file.
130#include "../../release/c/wuffs-unsupported-snapshot.c"
131
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100132#if defined(__linux__)
133#include <linux/prctl.h>
134#include <linux/seccomp.h>
135#include <sys/prctl.h>
136#include <sys/syscall.h>
137#define WUFFS_EXAMPLE_USE_SECCOMP
138#endif
139
Nigel Tao2cf76db2020-02-27 22:42:01 +1100140#define TRY(error_msg) \
141 do { \
142 const char* z = error_msg; \
143 if (z) { \
144 return z; \
145 } \
146 } while (false)
147
Nigel Taod60815c2020-03-26 14:32:35 +1100148static const char* g_eod = "main: end of data";
Nigel Tao2cf76db2020-02-27 22:42:01 +1100149
Nigel Taod60815c2020-03-26 14:32:35 +1100150static const char* g_usage =
Nigel Tao01abc842020-03-06 21:42:33 +1100151 "Usage: jsonptr -flags input.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100152 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100153 "Flags:\n"
Nigel Tao3690e832020-03-12 16:52:26 +1100154 " -c -compact-output\n"
Nigel Tao94440cf2020-04-02 22:28:24 +1100155 " -d=NUM -max-output-depth=NUM\n"
Nigel Tao4e193592020-07-15 12:48:57 +1000156 " -i=FMT -input-format={json,cbor}\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000157 " -o=FMT -output-format={json,cbor}\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100158 " -q=STR -query=STR\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000159 " -s=NUM -spaces=NUM\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100160 " -t -tabs\n"
161 " -fail-if-unsandboxed\n"
Nigel Tao3c8589b2020-07-19 21:49:00 +1000162 " -input-allow-json-comments\n"
163 " -input-allow-json-extra-comma\n"
Nigel Tao51a38292020-07-19 22:43:17 +1000164 " -input-allow-json-inf-nan-numbers\n"
Nigel Tao3c8589b2020-07-19 21:49:00 +1000165 " -output-cbor-metadata-as-json-comments\n"
Nigel Taoc766bb72020-07-09 12:59:32 +1000166 " -output-json-extra-comma\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000167 " -strict-json-pointer-syntax\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100168 "\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100169 "The input.json filename is optional. If absent, it reads from stdin.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100170 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100171 "----\n"
172 "\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100173 "jsonptr is a JSON formatter (pretty-printer) that supports the JSON\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000174 "Pointer (RFC 6901) query syntax. It reads CBOR or UTF-8 JSON from stdin\n"
175 "and writes CBOR or canonicalized, formatted UTF-8 JSON to stdout. The\n"
176 "input and output formats do not have to match, but conversion between\n"
177 "formats may be lossy.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100178 "\n"
Nigel Taof8dfc762020-07-23 23:35:44 +1000179 "Canonicalized JSON means that e.g. \"abc\\u000A\\tx\\u0177z\" is re-\n"
180 "written as \"abc\\n\\txÅ·z\". It does not sort object keys or reject\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100181 "duplicate keys. Canonicalization does not imply Unicode normalization.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100182 "\n"
Nigel Taof8dfc762020-07-23 23:35:44 +1000183 "CBOR output is non-canonical (in the RFC 7049 Section 3.9 sense), as\n"
184 "sorting map keys and measuring indefinite-length containers requires\n"
185 "O(input_length) memory but this program runs in O(1) memory.\n"
186 "\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100187 "Formatted means that arrays' and objects' elements are indented, each\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000188 "on its own line. Configure this with the -c / -compact-output, -s=NUM /\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000189 "-spaces=NUM (for NUM ranging from 0 to 8) and -t / -tabs flags. Those\n"
190 "flags only apply to JSON (not CBOR) output.\n"
191 "\n"
192 "The -input-format and -output-format flags select between reading and\n"
193 "writing JSON (the default, a textual format) or CBOR (a binary format).\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100194 "\n"
Nigel Tao3c8589b2020-07-19 21:49:00 +1000195 "The -input-allow-json-comments flag allows \"/*slash-star*/\" and\n"
196 "\"//slash-slash\" C-style comments within JSON input.\n"
197 "\n"
198 "The -input-allow-json-extra-comma flag allows input like \"[1,2,]\",\n"
199 "with a comma after the final element of a JSON list or dictionary.\n"
200 "\n"
Nigel Tao51a38292020-07-19 22:43:17 +1000201 "The -input-allow-json-inf-nan-numbers flag allows non-finite floating\n"
202 "point numbers (infinities and not-a-numbers) within JSON input.\n"
203 "\n"
Nigel Tao3c8589b2020-07-19 21:49:00 +1000204 "The -output-cbor-metadata-as-json-comments writes CBOR tags and other\n"
205 "metadata as /*comments*/, when -i=json and -o=cbor are also set. Such\n"
206 "comments are non-compliant with the JSON specification but many parsers\n"
207 "accept them.\n"
Nigel Taoc766bb72020-07-09 12:59:32 +1000208 "\n"
209 "The -output-json-extra-comma flag writes extra commas, regardless of\n"
210 "whether the input had it. Extra commas are non-compliant with the JSON\n"
Nigel Tao3c8589b2020-07-19 21:49:00 +1000211 "specification but many parsers accept them and they can produce simpler\n"
Nigel Taoc766bb72020-07-09 12:59:32 +1000212 "line-based diffs. This flag is ignored when -compact-output is set.\n"
213 "\n"
Nigel Taof8dfc762020-07-23 23:35:44 +1000214 "When converting from -i=cbor to -o=json, CBOR permits map keys other\n"
215 "than (untagged) UTF-8 strings but JSON does not. This program rejects\n"
216 "such input, as doing otherwise has complicated interactions with the\n"
217 "-query=STR flag and streaming input.\n"
218 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100219 "----\n"
220 "\n"
221 "The -q=STR or -query=STR flag gives an optional JSON Pointer query, to\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100222 "print a subset of the input. For example, given RFC 6901 section 5's\n"
Nigel Tao01abc842020-03-06 21:42:33 +1100223 "sample input (https://tools.ietf.org/rfc/rfc6901.txt), this command:\n"
224 " jsonptr -query=/foo/1 rfc-6901-json-pointer.json\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100225 "will print:\n"
226 " \"baz\"\n"
227 "\n"
228 "An absent query is equivalent to the empty query, which identifies the\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100229 "entire input (the root value). Unlike a file system, the \"/\" query\n"
Nigel Taod0b16cb2020-03-14 10:15:54 +1100230 "does not identify the root. Instead, \"\" is the root and \"/\" is the\n"
231 "child (the value in a key-value pair) of the root whose key is the empty\n"
232 "string. Similarly, \"/xyz\" and \"/xyz/\" are two different nodes.\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100233 "\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000234 "If the query found a valid JSON|CBOR value, this program will return a\n"
235 "zero exit code even if the rest of the input isn't valid. If the query\n"
Nigel Tao0cd2f982020-03-03 23:03:02 +1100236 "did not find a value, or found an invalid one, this program returns a\n"
237 "non-zero exit code, but may still print partial output to stdout.\n"
238 "\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000239 "The JSON and CBOR specifications (https://json.org/ or RFC 8259; RFC\n"
240 "7049) permit implementations to allow duplicate keys, as this one does.\n"
241 "This JSON Pointer implementation is also greedy, following the first\n"
242 "match for each fragment without back-tracking. For example, the\n"
243 "\"/foo/bar\" query will fail if the root object has multiple \"foo\"\n"
244 "children but the first one doesn't have a \"bar\" child, even if later\n"
245 "ones do.\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100246 "\n"
Nigel Taoecadf722020-07-13 08:22:34 +1000247 "The -strict-json-pointer-syntax flag restricts the -query=STR string to\n"
248 "exactly RFC 6901, with only two escape sequences: \"~0\" and \"~1\" for\n"
249 "\"~\" and \"/\". Without this flag, this program also lets \"~n\" and\n"
250 "\"~r\" escape the New Line and Carriage Return ASCII control characters,\n"
251 "which can work better with line oriented Unix tools that assume exactly\n"
252 "one value (i.e. one JSON Pointer string) per line.\n"
Nigel Taod6fdfb12020-03-11 12:24:14 +1100253 "\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100254 "----\n"
255 "\n"
Nigel Tao94440cf2020-04-02 22:28:24 +1100256 "The -d=NUM or -max-output-depth=NUM flag gives the maximum (inclusive)\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000257 "output depth. JSON|CBOR containers ([] arrays and {} objects) can hold\n"
258 "other containers. When this flag is set, containers at depth NUM are\n"
259 "replaced with \"[…]\" or \"{…}\". A bare -d or -max-output-depth is\n"
260 "equivalent to -d=1. The flag's absence means an unlimited output depth.\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100261 "\n"
262 "The -max-output-depth flag only affects the program's output. It doesn't\n"
Nigel Tao168f60a2020-07-14 13:19:33 +1000263 "affect whether or not the input is considered valid JSON|CBOR. The\n"
264 "format specifications permit implementations to set their own maximum\n"
265 "input depth. This JSON|CBOR implementation sets it to 1024.\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100266 "\n"
267 "Depth is measured in terms of nested containers. It is unaffected by the\n"
268 "number of spaces or tabs used to indent.\n"
269 "\n"
270 "When both -max-output-depth and -query are set, the output depth is\n"
271 "measured from when the query resolves, not from the input root. The\n"
272 "input depth (measured from the root) is still limited to 1024.\n"
273 "\n"
274 "----\n"
275 "\n"
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100276 "The -fail-if-unsandboxed flag causes the program to exit if it does not\n"
277 "self-impose a sandbox. On Linux, it self-imposes a SECCOMP_MODE_STRICT\n"
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100278 "sandbox, regardless of whether this flag was set.";
Nigel Tao0cd2f982020-03-03 23:03:02 +1100279
Nigel Tao2cf76db2020-02-27 22:42:01 +1100280// ----
281
Nigel Taof3146c22020-03-26 08:47:42 +1100282// Wuffs allows either statically or dynamically allocated work buffers. This
283// program exercises static allocation.
284#define WORK_BUFFER_ARRAY_SIZE \
285 WUFFS_JSON__DECODER_WORKBUF_LEN_MAX_INCL_WORST_CASE
286#if WORK_BUFFER_ARRAY_SIZE > 0
Nigel Taod60815c2020-03-26 14:32:35 +1100287uint8_t g_work_buffer_array[WORK_BUFFER_ARRAY_SIZE];
Nigel Taof3146c22020-03-26 08:47:42 +1100288#else
289// Not all C/C++ compilers support 0-length arrays.
Nigel Taod60815c2020-03-26 14:32:35 +1100290uint8_t g_work_buffer_array[1];
Nigel Taof3146c22020-03-26 08:47:42 +1100291#endif
292
Nigel Taod60815c2020-03-26 14:32:35 +1100293bool g_sandboxed = false;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100294
Nigel Taod60815c2020-03-26 14:32:35 +1100295int g_input_file_descriptor = 0; // A 0 default means stdin.
Nigel Tao01abc842020-03-06 21:42:33 +1100296
Nigel Tao2cf76db2020-02-27 22:42:01 +1100297#define MAX_INDENT 8
Nigel Tao107f0ef2020-03-01 21:35:02 +1100298#define INDENT_SPACES_STRING " "
Nigel Tao6e7d1412020-03-06 09:21:35 +1100299#define INDENT_TAB_STRING "\t"
Nigel Tao107f0ef2020-03-01 21:35:02 +1100300
Nigel Taofdac24a2020-03-06 21:53:08 +1100301#ifndef DST_BUFFER_ARRAY_SIZE
302#define DST_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100303#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100304#ifndef SRC_BUFFER_ARRAY_SIZE
305#define SRC_BUFFER_ARRAY_SIZE (32 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100306#endif
Nigel Taofdac24a2020-03-06 21:53:08 +1100307#ifndef TOKEN_BUFFER_ARRAY_SIZE
308#define TOKEN_BUFFER_ARRAY_SIZE (4 * 1024)
Nigel Tao1b073492020-02-16 22:11:36 +1100309#endif
310
Nigel Taod60815c2020-03-26 14:32:35 +1100311uint8_t g_dst_array[DST_BUFFER_ARRAY_SIZE];
312uint8_t g_src_array[SRC_BUFFER_ARRAY_SIZE];
313wuffs_base__token g_tok_array[TOKEN_BUFFER_ARRAY_SIZE];
Nigel Tao1b073492020-02-16 22:11:36 +1100314
Nigel Taod60815c2020-03-26 14:32:35 +1100315wuffs_base__io_buffer g_dst;
316wuffs_base__io_buffer g_src;
317wuffs_base__token_buffer g_tok;
Nigel Tao1b073492020-02-16 22:11:36 +1100318
Nigel Taod60815c2020-03-26 14:32:35 +1100319// g_curr_token_end_src_index is the g_src.data.ptr index of the end of the
320// current token. An invariant is that (g_curr_token_end_src_index <=
321// g_src.meta.ri).
322size_t g_curr_token_end_src_index;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100323
Nigel Tao850dc182020-07-21 22:52:04 +1000324struct {
325 uint64_t category;
326 uint64_t detail;
327} g_token_extension;
328
Nigel Taod60815c2020-03-26 14:32:35 +1100329uint32_t g_depth;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100330
331enum class context {
332 none,
333 in_list_after_bracket,
334 in_list_after_value,
335 in_dict_after_brace,
336 in_dict_after_key,
337 in_dict_after_value,
Nigel Taod60815c2020-03-26 14:32:35 +1100338} g_ctx;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100339
Nigel Tao0cd2f982020-03-03 23:03:02 +1100340bool //
341in_dict_before_key() {
Nigel Taod60815c2020-03-26 14:32:35 +1100342 return (g_ctx == context::in_dict_after_brace) ||
343 (g_ctx == context::in_dict_after_value);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100344}
345
Nigel Taod60815c2020-03-26 14:32:35 +1100346uint32_t g_suppress_write_dst;
347bool g_wrote_to_dst;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100348
Nigel Tao4e193592020-07-15 12:48:57 +1000349wuffs_cbor__decoder g_cbor_decoder;
350wuffs_json__decoder g_json_decoder;
351wuffs_base__token_decoder* g_dec;
Nigel Tao1b073492020-02-16 22:11:36 +1100352
Nigel Tao168f60a2020-07-14 13:19:33 +1000353// cbor_output_string_array is a 4 KiB buffer. For -output-format=cbor, strings
354// whose length are 4096 or less are written as a single definite-length
355// string. Longer strings are written as an indefinite-length string containing
356// multiple definite-length chunks, each of length up to 4 KiB. See the CBOR
357// RFC (RFC 7049) section 2.2.2 "Indefinite-Length Byte Strings and Text
358// Strings". The output is determinate even when the input is streamed.
359//
360// If raising CBOR_OUTPUT_STRING_ARRAY_SIZE above 0xFFFF then you will also
361// have to update flush_cbor_output_string.
362#define CBOR_OUTPUT_STRING_ARRAY_SIZE 4096
363uint8_t g_cbor_output_string_array[CBOR_OUTPUT_STRING_ARRAY_SIZE];
364
365uint32_t g_cbor_output_string_length;
366bool g_cbor_output_string_is_multiple_chunks;
367bool g_cbor_output_string_is_utf_8;
368
Nigel Tao0cd2f982020-03-03 23:03:02 +1100369// ----
370
371// Query is a JSON Pointer query. After initializing with a NUL-terminated C
372// string, its multiple fragments are consumed as the program walks the JSON
373// data from stdin. For example, letting "$" denote a NUL, suppose that we
374// started with a query string of "/apple/banana/12/durian" and are currently
Nigel Taob48ee752020-03-13 09:27:33 +1100375// trying to match the second fragment, "banana", so that Query::m_depth is 2:
Nigel Tao0cd2f982020-03-03 23:03:02 +1100376//
377// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
378// / a p p l e / b a n a n a / 1 2 / d u r i a n $
379// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
380// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100381// m_frag_i m_frag_k
Nigel Tao0cd2f982020-03-03 23:03:02 +1100382//
Nigel Taob48ee752020-03-13 09:27:33 +1100383// The two pointers m_frag_i and m_frag_k (abbreviated as mfi and mfk) are the
384// start (inclusive) and end (exclusive) of the query fragment. They satisfy
385// (mfi <= mfk) and may be equal if the fragment empty (note that "" is a valid
386// JSON object key).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100387//
Nigel Taob48ee752020-03-13 09:27:33 +1100388// The m_frag_j (mfj) pointer moves between these two, or is nullptr. An
389// invariant is that (((mfi <= mfj) && (mfj <= mfk)) || (mfj == nullptr)).
Nigel Tao0cd2f982020-03-03 23:03:02 +1100390//
391// Wuffs' JSON tokenizer can portray a single JSON string as multiple Wuffs
392// tokens, as backslash-escaped values within that JSON string may each get
393// their own token.
394//
Nigel Taob48ee752020-03-13 09:27:33 +1100395// At the start of each object key (a JSON string), mfj is set to mfi.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100396//
Nigel Taob48ee752020-03-13 09:27:33 +1100397// While mfj remains non-nullptr, each token's unescaped contents are then
398// compared to that part of the fragment from mfj to mfk. If it is a prefix
399// (including the case of an exact match), then mfj is advanced by the
400// unescaped length. Otherwise, mfj is set to nullptr.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100401//
402// Comparison accounts for JSON Pointer's escaping notation: "~0" and "~1" in
403// the query (not the JSON value) are unescaped to "~" and "/" respectively.
Nigel Taob48ee752020-03-13 09:27:33 +1100404// "~n" and "~r" are also unescaped to "\n" and "\r". The program is
405// responsible for calling Query::validate (with a strict_json_pointer_syntax
406// argument) before otherwise using this class.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100407//
Nigel Taob48ee752020-03-13 09:27:33 +1100408// The mfj pointer therefore advances from mfi to mfk, or drops out, as we
409// incrementally match the object key with the query fragment. For example, if
410// we have already matched the "ban" of "banana", then we would accept any of
411// an "ana" token, an "a" token or a "\u0061" token, amongst others. They would
412// advance mfj by 3, 1 or 1 bytes respectively.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100413//
Nigel Taob48ee752020-03-13 09:27:33 +1100414// mfj
Nigel Tao0cd2f982020-03-03 23:03:02 +1100415// v
416// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
417// / a p p l e / b a n a n a / 1 2 / d u r i a n $
418// +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
419// ^ ^
Nigel Taob48ee752020-03-13 09:27:33 +1100420// mfi mfk
Nigel Tao0cd2f982020-03-03 23:03:02 +1100421//
422// At the end of each object key (or equivalently, at the start of each object
Nigel Taob48ee752020-03-13 09:27:33 +1100423// value), if mfj is non-nullptr and equal to (but not less than) mfk then we
424// have a fragment match: the query fragment equals the object key. If there is
425// a next fragment (in this example, "12") we move the frag_etc pointers to its
426// start and end and increment Query::m_depth. Otherwise, we have matched the
427// complete query, and the upcoming JSON value is the result of that query.
Nigel Tao0cd2f982020-03-03 23:03:02 +1100428//
429// The discussion above centers on object keys. If the query fragment is
430// numeric then it can also match as an array index: the string fragment "12"
431// will match an array's 13th element (starting counting from zero). See RFC
432// 6901 for its precise definition of an "array index" number.
433//
Nigel Taob48ee752020-03-13 09:27:33 +1100434// Array index fragment match is represented by the Query::m_array_index field,
Nigel Tao0cd2f982020-03-03 23:03:02 +1100435// whose type (wuffs_base__result_u64) is a result type. An error result means
436// that the fragment is not an array index. A value result holds the number of
437// list elements remaining. When matching a query fragment in an array (instead
438// of in an object), each element ticks this number down towards zero. At zero,
439// the upcoming JSON value is the one that matches the query fragment.
440class Query {
441 private:
Nigel Taob48ee752020-03-13 09:27:33 +1100442 uint8_t* m_frag_i;
443 uint8_t* m_frag_j;
444 uint8_t* m_frag_k;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100445
Nigel Taob48ee752020-03-13 09:27:33 +1100446 uint32_t m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100447
Nigel Taob48ee752020-03-13 09:27:33 +1100448 wuffs_base__result_u64 m_array_index;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100449
450 public:
451 void reset(char* query_c_string) {
Nigel Taob48ee752020-03-13 09:27:33 +1100452 m_frag_i = (uint8_t*)query_c_string;
453 m_frag_j = (uint8_t*)query_c_string;
454 m_frag_k = (uint8_t*)query_c_string;
455 m_depth = 0;
456 m_array_index.status.repr = "#main: not an array index query fragment";
457 m_array_index.value = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100458 }
459
Nigel Taob48ee752020-03-13 09:27:33 +1100460 void restart_fragment(bool enable) { m_frag_j = enable ? m_frag_i : nullptr; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100461
Nigel Taob48ee752020-03-13 09:27:33 +1100462 bool is_at(uint32_t depth) { return m_depth == depth; }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100463
464 // tick returns whether the fragment is a valid array index whose value is
465 // zero. If valid but non-zero, it decrements it and returns false.
466 bool tick() {
Nigel Taob48ee752020-03-13 09:27:33 +1100467 if (m_array_index.status.is_ok()) {
468 if (m_array_index.value == 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100469 return true;
470 }
Nigel Taob48ee752020-03-13 09:27:33 +1100471 m_array_index.value--;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100472 }
473 return false;
474 }
475
476 // next_fragment moves to the next fragment, returning whether it existed.
477 bool next_fragment() {
Nigel Taob48ee752020-03-13 09:27:33 +1100478 uint8_t* k = m_frag_k;
479 uint32_t d = m_depth;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100480
481 this->reset(nullptr);
482
483 if (!k || (*k != '/')) {
484 return false;
485 }
486 k++;
487
488 bool all_digits = true;
489 uint8_t* i = k;
490 while ((*k != '\x00') && (*k != '/')) {
491 all_digits = all_digits && ('0' <= *k) && (*k <= '9');
492 k++;
493 }
Nigel Taob48ee752020-03-13 09:27:33 +1100494 m_frag_i = i;
495 m_frag_j = i;
496 m_frag_k = k;
497 m_depth = d + 1;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100498 if (all_digits) {
499 // wuffs_base__parse_number_u64 rejects leading zeroes, e.g. "00", "07".
Nigel Tao6b7ce302020-07-07 16:19:46 +1000500 m_array_index = wuffs_base__parse_number_u64(
501 wuffs_base__make_slice_u8(i, k - i),
502 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100503 }
504 return true;
505 }
506
Nigel Taob48ee752020-03-13 09:27:33 +1100507 bool matched_all() { return m_frag_k == nullptr; }
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100508
Nigel Taob48ee752020-03-13 09:27:33 +1100509 bool matched_fragment() { return m_frag_j && (m_frag_j == m_frag_k); }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100510
511 void incremental_match_slice(uint8_t* ptr, size_t len) {
Nigel Taob48ee752020-03-13 09:27:33 +1100512 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100513 return;
514 }
Nigel Taob48ee752020-03-13 09:27:33 +1100515 uint8_t* j = m_frag_j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100516 while (true) {
517 if (len == 0) {
Nigel Taob48ee752020-03-13 09:27:33 +1100518 m_frag_j = j;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100519 return;
520 }
521
522 if (*j == '\x00') {
523 break;
524
525 } else if (*j == '~') {
526 j++;
527 if (*j == '0') {
528 if (*ptr != '~') {
529 break;
530 }
531 } else if (*j == '1') {
532 if (*ptr != '/') {
533 break;
534 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100535 } else if (*j == 'n') {
536 if (*ptr != '\n') {
537 break;
538 }
539 } else if (*j == 'r') {
540 if (*ptr != '\r') {
541 break;
542 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100543 } else {
544 break;
545 }
546
547 } else if (*j != *ptr) {
548 break;
549 }
550
551 j++;
552 ptr++;
553 len--;
554 }
Nigel Taob48ee752020-03-13 09:27:33 +1100555 m_frag_j = nullptr;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100556 }
557
558 void incremental_match_code_point(uint32_t code_point) {
Nigel Taob48ee752020-03-13 09:27:33 +1100559 if (!m_frag_j) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100560 return;
561 }
562 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
563 size_t n = wuffs_base__utf_8__encode(
564 wuffs_base__make_slice_u8(&u[0],
565 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
566 code_point);
567 if (n > 0) {
568 this->incremental_match_slice(&u[0], n);
569 }
570 }
571
572 // validate returns whether the (ptr, len) arguments form a valid JSON
573 // Pointer. In particular, it must be valid UTF-8, and either be empty or
574 // start with a '/'. Any '~' within must immediately be followed by either
Nigel Taod6fdfb12020-03-11 12:24:14 +1100575 // '0' or '1'. If strict_json_pointer_syntax is false, a '~' may also be
576 // followed by either 'n' or 'r'.
577 static bool validate(char* query_c_string,
578 size_t length,
579 bool strict_json_pointer_syntax) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100580 if (length <= 0) {
581 return true;
582 }
583 if (query_c_string[0] != '/') {
584 return false;
585 }
586 wuffs_base__slice_u8 s =
587 wuffs_base__make_slice_u8((uint8_t*)query_c_string, length);
588 bool previous_was_tilde = false;
589 while (s.len > 0) {
Nigel Tao702c7b22020-07-22 15:42:54 +1000590 wuffs_base__utf_8__next__output o = wuffs_base__utf_8__next(s.ptr, s.len);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100591 if (!o.is_valid()) {
592 return false;
593 }
Nigel Taod6fdfb12020-03-11 12:24:14 +1100594
595 if (previous_was_tilde) {
596 switch (o.code_point) {
597 case '0':
598 case '1':
599 break;
600 case 'n':
601 case 'r':
602 if (strict_json_pointer_syntax) {
603 return false;
604 }
605 break;
606 default:
607 return false;
608 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100609 }
610 previous_was_tilde = o.code_point == '~';
Nigel Taod6fdfb12020-03-11 12:24:14 +1100611
Nigel Tao0cd2f982020-03-03 23:03:02 +1100612 s.ptr += o.byte_length;
613 s.len -= o.byte_length;
614 }
615 return !previous_was_tilde;
616 }
Nigel Taod60815c2020-03-26 14:32:35 +1100617} g_query;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100618
619// ----
620
Nigel Tao168f60a2020-07-14 13:19:33 +1000621enum class file_format {
622 json,
623 cbor,
624};
625
Nigel Tao68920952020-03-03 11:25:18 +1100626struct {
627 int remaining_argc;
628 char** remaining_argv;
629
Nigel Tao3690e832020-03-12 16:52:26 +1100630 bool compact_output;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100631 bool fail_if_unsandboxed;
Nigel Tao4e193592020-07-15 12:48:57 +1000632 file_format input_format;
Nigel Tao3c8589b2020-07-19 21:49:00 +1000633 bool input_allow_json_comments;
634 bool input_allow_json_extra_comma;
Nigel Tao51a38292020-07-19 22:43:17 +1000635 bool input_allow_json_inf_nan_numbers;
Nigel Tao52c4d6a2020-03-08 21:12:38 +1100636 uint32_t max_output_depth;
Nigel Tao168f60a2020-07-14 13:19:33 +1000637 file_format output_format;
Nigel Tao3c8589b2020-07-19 21:49:00 +1000638 bool output_cbor_metadata_as_json_comments;
Nigel Taoc766bb72020-07-09 12:59:32 +1000639 bool output_json_extra_comma;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100640 char* query_c_string;
Nigel Taoecadf722020-07-13 08:22:34 +1000641 size_t spaces;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100642 bool strict_json_pointer_syntax;
Nigel Tao68920952020-03-03 11:25:18 +1100643 bool tabs;
Nigel Taod60815c2020-03-26 14:32:35 +1100644} g_flags = {0};
Nigel Tao68920952020-03-03 11:25:18 +1100645
646const char* //
647parse_flags(int argc, char** argv) {
Nigel Taoecadf722020-07-13 08:22:34 +1000648 g_flags.spaces = 4;
Nigel Taod60815c2020-03-26 14:32:35 +1100649 g_flags.max_output_depth = 0xFFFFFFFF;
Nigel Tao68920952020-03-03 11:25:18 +1100650
651 int c = (argc > 0) ? 1 : 0; // Skip argv[0], the program name.
652 for (; c < argc; c++) {
653 char* arg = argv[c];
654 if (*arg++ != '-') {
655 break;
656 }
657
658 // A double-dash "--foo" is equivalent to a single-dash "-foo". As special
659 // cases, a bare "-" is not a flag (some programs may interpret it as
660 // stdin) and a bare "--" means to stop parsing flags.
661 if (*arg == '\x00') {
662 break;
663 } else if (*arg == '-') {
664 arg++;
665 if (*arg == '\x00') {
666 c++;
667 break;
668 }
669 }
670
Nigel Tao3690e832020-03-12 16:52:26 +1100671 if (!strcmp(arg, "c") || !strcmp(arg, "compact-output")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100672 g_flags.compact_output = true;
Nigel Tao68920952020-03-03 11:25:18 +1100673 continue;
674 }
Nigel Tao94440cf2020-04-02 22:28:24 +1100675 if (!strcmp(arg, "d") || !strcmp(arg, "max-output-depth")) {
676 g_flags.max_output_depth = 1;
677 continue;
678 } else if (!strncmp(arg, "d=", 2) ||
679 !strncmp(arg, "max-output-depth=", 16)) {
680 while (*arg++ != '=') {
681 }
682 wuffs_base__result_u64 u = wuffs_base__parse_number_u64(
Nigel Tao6b7ce302020-07-07 16:19:46 +1000683 wuffs_base__make_slice_u8((uint8_t*)arg, strlen(arg)),
684 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
Nigel Taoaf757722020-07-18 17:27:11 +1000685 if (u.status.is_ok() && (u.value <= 0xFFFFFFFF)) {
Nigel Tao94440cf2020-04-02 22:28:24 +1100686 g_flags.max_output_depth = (uint32_t)(u.value);
687 continue;
688 }
689 return g_usage;
690 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100691 if (!strcmp(arg, "fail-if-unsandboxed")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100692 g_flags.fail_if_unsandboxed = true;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100693 continue;
694 }
Nigel Tao4e193592020-07-15 12:48:57 +1000695 if (!strcmp(arg, "i=cbor") || !strcmp(arg, "input-format=cbor")) {
696 g_flags.input_format = file_format::cbor;
697 continue;
698 }
699 if (!strcmp(arg, "i=json") || !strcmp(arg, "input-format=json")) {
700 g_flags.input_format = file_format::json;
701 continue;
702 }
Nigel Tao3c8589b2020-07-19 21:49:00 +1000703 if (!strcmp(arg, "input-allow-json-comments")) {
704 g_flags.input_allow_json_comments = true;
705 continue;
706 }
707 if (!strcmp(arg, "input-allow-json-extra-comma")) {
708 g_flags.input_allow_json_extra_comma = true;
Nigel Taoc766bb72020-07-09 12:59:32 +1000709 continue;
710 }
Nigel Tao51a38292020-07-19 22:43:17 +1000711 if (!strcmp(arg, "input-allow-json-inf-nan-numbers")) {
712 g_flags.input_allow_json_inf_nan_numbers = true;
713 continue;
714 }
Nigel Tao168f60a2020-07-14 13:19:33 +1000715 if (!strcmp(arg, "o=cbor") || !strcmp(arg, "output-format=cbor")) {
716 g_flags.output_format = file_format::cbor;
717 continue;
718 }
719 if (!strcmp(arg, "o=json") || !strcmp(arg, "output-format=json")) {
720 g_flags.output_format = file_format::json;
721 continue;
722 }
Nigel Tao3c8589b2020-07-19 21:49:00 +1000723 if (!strcmp(arg, "output-cbor-metadata-as-json-comments")) {
724 g_flags.output_cbor_metadata_as_json_comments = true;
725 continue;
726 }
Nigel Taoc766bb72020-07-09 12:59:32 +1000727 if (!strcmp(arg, "output-json-extra-comma")) {
728 g_flags.output_json_extra_comma = true;
729 continue;
730 }
Nigel Tao0cd2f982020-03-03 23:03:02 +1100731 if (!strncmp(arg, "q=", 2) || !strncmp(arg, "query=", 6)) {
732 while (*arg++ != '=') {
733 }
Nigel Taod60815c2020-03-26 14:32:35 +1100734 g_flags.query_c_string = arg;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100735 continue;
736 }
Nigel Taoecadf722020-07-13 08:22:34 +1000737 if (!strncmp(arg, "s=", 2) || !strncmp(arg, "spaces=", 7)) {
738 while (*arg++ != '=') {
739 }
740 if (('0' <= arg[0]) && (arg[0] <= '8') && (arg[1] == '\x00')) {
741 g_flags.spaces = arg[0] - '0';
742 continue;
743 }
744 return g_usage;
745 }
746 if (!strcmp(arg, "strict-json-pointer-syntax")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100747 g_flags.strict_json_pointer_syntax = true;
Nigel Taod6fdfb12020-03-11 12:24:14 +1100748 continue;
Nigel Tao68920952020-03-03 11:25:18 +1100749 }
750 if (!strcmp(arg, "t") || !strcmp(arg, "tabs")) {
Nigel Taod60815c2020-03-26 14:32:35 +1100751 g_flags.tabs = true;
Nigel Tao68920952020-03-03 11:25:18 +1100752 continue;
753 }
754
Nigel Taod60815c2020-03-26 14:32:35 +1100755 return g_usage;
Nigel Tao68920952020-03-03 11:25:18 +1100756 }
757
Nigel Taod60815c2020-03-26 14:32:35 +1100758 if (g_flags.query_c_string &&
759 !Query::validate(g_flags.query_c_string, strlen(g_flags.query_c_string),
760 g_flags.strict_json_pointer_syntax)) {
Nigel Taod6fdfb12020-03-11 12:24:14 +1100761 return "main: bad JSON Pointer (RFC 6901) syntax for the -query=STR flag";
762 }
763
Nigel Taod60815c2020-03-26 14:32:35 +1100764 g_flags.remaining_argc = argc - c;
765 g_flags.remaining_argv = argv + c;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100766 return nullptr;
Nigel Tao68920952020-03-03 11:25:18 +1100767}
768
Nigel Tao2cf76db2020-02-27 22:42:01 +1100769const char* //
770initialize_globals(int argc, char** argv) {
Nigel Taod60815c2020-03-26 14:32:35 +1100771 g_dst = wuffs_base__make_io_buffer(
772 wuffs_base__make_slice_u8(g_dst_array, DST_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100773 wuffs_base__empty_io_buffer_meta());
Nigel Tao1b073492020-02-16 22:11:36 +1100774
Nigel Taod60815c2020-03-26 14:32:35 +1100775 g_src = wuffs_base__make_io_buffer(
776 wuffs_base__make_slice_u8(g_src_array, SRC_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100777 wuffs_base__empty_io_buffer_meta());
778
Nigel Taod60815c2020-03-26 14:32:35 +1100779 g_tok = wuffs_base__make_token_buffer(
780 wuffs_base__make_slice_token(g_tok_array, TOKEN_BUFFER_ARRAY_SIZE),
Nigel Tao2cf76db2020-02-27 22:42:01 +1100781 wuffs_base__empty_token_buffer_meta());
782
Nigel Taod60815c2020-03-26 14:32:35 +1100783 g_curr_token_end_src_index = 0;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100784
Nigel Tao850dc182020-07-21 22:52:04 +1000785 g_token_extension.category = 0;
786 g_token_extension.detail = 0;
787
Nigel Taod60815c2020-03-26 14:32:35 +1100788 g_depth = 0;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100789
Nigel Taod60815c2020-03-26 14:32:35 +1100790 g_ctx = context::none;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100791
Nigel Tao68920952020-03-03 11:25:18 +1100792 TRY(parse_flags(argc, argv));
Nigel Taod60815c2020-03-26 14:32:35 +1100793 if (g_flags.fail_if_unsandboxed && !g_sandboxed) {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100794 return "main: unsandboxed";
795 }
Nigel Tao01abc842020-03-06 21:42:33 +1100796 const int stdin_fd = 0;
Nigel Taod60815c2020-03-26 14:32:35 +1100797 if (g_flags.remaining_argc >
798 ((g_input_file_descriptor != stdin_fd) ? 1 : 0)) {
799 return g_usage;
Nigel Tao107f0ef2020-03-01 21:35:02 +1100800 }
801
Nigel Taod60815c2020-03-26 14:32:35 +1100802 g_query.reset(g_flags.query_c_string);
Nigel Tao0cd2f982020-03-03 23:03:02 +1100803
804 // If the query is non-empty, suprress writing to stdout until we've
805 // completed the query.
Nigel Taod60815c2020-03-26 14:32:35 +1100806 g_suppress_write_dst = g_query.next_fragment() ? 1 : 0;
807 g_wrote_to_dst = false;
Nigel Tao0cd2f982020-03-03 23:03:02 +1100808
Nigel Tao4e193592020-07-15 12:48:57 +1000809 if (g_flags.input_format == file_format::json) {
810 TRY(g_json_decoder
811 .initialize(sizeof__wuffs_json__decoder(), WUFFS_VERSION, 0)
812 .message());
813 g_dec = g_json_decoder.upcast_as__wuffs_base__token_decoder();
814 } else {
815 TRY(g_cbor_decoder
816 .initialize(sizeof__wuffs_cbor__decoder(), WUFFS_VERSION, 0)
817 .message());
818 g_dec = g_cbor_decoder.upcast_as__wuffs_base__token_decoder();
819 }
Nigel Tao4b186b02020-03-18 14:25:21 +1100820
Nigel Tao3c8589b2020-07-19 21:49:00 +1000821 if (g_flags.input_allow_json_comments) {
822 g_dec->set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_COMMENT_BLOCK, true);
823 g_dec->set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_COMMENT_LINE, true);
824 }
825 if (g_flags.input_allow_json_extra_comma) {
Nigel Tao4e193592020-07-15 12:48:57 +1000826 g_dec->set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_EXTRA_COMMA, true);
Nigel Taoc766bb72020-07-09 12:59:32 +1000827 }
Nigel Tao51a38292020-07-19 22:43:17 +1000828 if (g_flags.input_allow_json_inf_nan_numbers) {
829 g_dec->set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_INF_NAN_NUMBERS, true);
830 }
Nigel Taoc766bb72020-07-09 12:59:32 +1000831
Nigel Tao4b186b02020-03-18 14:25:21 +1100832 // Consume an optional whitespace trailer. This isn't part of the JSON spec,
833 // but it works better with line oriented Unix tools (such as "echo 123 |
834 // jsonptr" where it's "echo", not "echo -n") or hand-edited JSON files which
835 // can accidentally contain trailing whitespace.
Nigel Tao4e193592020-07-15 12:48:57 +1000836 g_dec->set_quirk_enabled(WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE, true);
Nigel Tao4b186b02020-03-18 14:25:21 +1100837
838 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +1100839}
Nigel Tao1b073492020-02-16 22:11:36 +1100840
841// ----
842
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100843// ignore_return_value suppresses errors from -Wall -Werror.
844static void //
845ignore_return_value(int ignored) {}
846
Nigel Tao2914bae2020-02-26 09:40:30 +1100847const char* //
848read_src() {
Nigel Taod60815c2020-03-26 14:32:35 +1100849 if (g_src.meta.closed) {
Nigel Tao9cc2c252020-02-23 17:05:49 +1100850 return "main: internal error: read requested on a closed source";
Nigel Taoa8406922020-02-19 12:22:00 +1100851 }
Nigel Taod60815c2020-03-26 14:32:35 +1100852 g_src.compact();
853 if (g_src.meta.wi >= g_src.data.len) {
854 return "main: g_src buffer is full";
Nigel Tao1b073492020-02-16 22:11:36 +1100855 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100856 while (true) {
Nigel Taod60815c2020-03-26 14:32:35 +1100857 ssize_t n = read(g_input_file_descriptor, g_src.data.ptr + g_src.meta.wi,
858 g_src.data.len - g_src.meta.wi);
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100859 if (n >= 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100860 g_src.meta.wi += n;
861 g_src.meta.closed = n == 0;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100862 break;
863 } else if (errno != EINTR) {
864 return strerror(errno);
865 }
Nigel Tao1b073492020-02-16 22:11:36 +1100866 }
867 return nullptr;
868}
869
Nigel Tao2914bae2020-02-26 09:40:30 +1100870const char* //
871flush_dst() {
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100872 while (true) {
Nigel Taod60815c2020-03-26 14:32:35 +1100873 size_t n = g_dst.meta.wi - g_dst.meta.ri;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100874 if (n == 0) {
875 break;
Nigel Tao1b073492020-02-16 22:11:36 +1100876 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100877 const int stdout_fd = 1;
Nigel Taod60815c2020-03-26 14:32:35 +1100878 ssize_t i = write(stdout_fd, g_dst.data.ptr + g_dst.meta.ri, n);
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100879 if (i >= 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100880 g_dst.meta.ri += i;
Nigel Taofe0cbbd2020-03-05 22:01:30 +1100881 } else if (errno != EINTR) {
882 return strerror(errno);
883 }
Nigel Tao1b073492020-02-16 22:11:36 +1100884 }
Nigel Taod60815c2020-03-26 14:32:35 +1100885 g_dst.compact();
Nigel Tao1b073492020-02-16 22:11:36 +1100886 return nullptr;
887}
888
Nigel Tao2914bae2020-02-26 09:40:30 +1100889const char* //
890write_dst(const void* s, size_t n) {
Nigel Taod60815c2020-03-26 14:32:35 +1100891 if (g_suppress_write_dst > 0) {
Nigel Tao0cd2f982020-03-03 23:03:02 +1100892 return nullptr;
893 }
Nigel Tao1b073492020-02-16 22:11:36 +1100894 const uint8_t* p = static_cast<const uint8_t*>(s);
895 while (n > 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100896 size_t i = g_dst.writer_available();
Nigel Tao1b073492020-02-16 22:11:36 +1100897 if (i == 0) {
898 const char* z = flush_dst();
899 if (z) {
900 return z;
901 }
Nigel Taod60815c2020-03-26 14:32:35 +1100902 i = g_dst.writer_available();
Nigel Tao1b073492020-02-16 22:11:36 +1100903 if (i == 0) {
Nigel Taod60815c2020-03-26 14:32:35 +1100904 return "main: g_dst buffer is full";
Nigel Tao1b073492020-02-16 22:11:36 +1100905 }
906 }
907
908 if (i > n) {
909 i = n;
910 }
Nigel Taod60815c2020-03-26 14:32:35 +1100911 memcpy(g_dst.data.ptr + g_dst.meta.wi, p, i);
912 g_dst.meta.wi += i;
Nigel Tao1b073492020-02-16 22:11:36 +1100913 p += i;
914 n -= i;
Nigel Taod60815c2020-03-26 14:32:35 +1100915 g_wrote_to_dst = true;
Nigel Tao1b073492020-02-16 22:11:36 +1100916 }
917 return nullptr;
918}
919
920// ----
921
Nigel Tao168f60a2020-07-14 13:19:33 +1000922const char* //
923write_literal(uint64_t vbd) {
924 const char* ptr = nullptr;
925 size_t len = 0;
926 if (vbd & WUFFS_BASE__TOKEN__VBD__LITERAL__UNDEFINED) {
927 if (g_flags.output_format == file_format::json) {
Nigel Tao3c8589b2020-07-19 21:49:00 +1000928 // JSON's closest approximation to "undefined" is "null".
929 if (g_flags.output_cbor_metadata_as_json_comments) {
930 ptr = "/*cbor:undefined*/null";
931 len = 22;
932 } else {
933 ptr = "null";
934 len = 4;
935 }
Nigel Tao168f60a2020-07-14 13:19:33 +1000936 } else {
937 ptr = "\xF7";
938 len = 1;
939 }
940 } else if (vbd & WUFFS_BASE__TOKEN__VBD__LITERAL__NULL) {
941 if (g_flags.output_format == file_format::json) {
942 ptr = "null";
943 len = 4;
944 } else {
945 ptr = "\xF6";
946 len = 1;
947 }
948 } else if (vbd & WUFFS_BASE__TOKEN__VBD__LITERAL__FALSE) {
949 if (g_flags.output_format == file_format::json) {
950 ptr = "false";
951 len = 5;
952 } else {
953 ptr = "\xF4";
954 len = 1;
955 }
956 } else if (vbd & WUFFS_BASE__TOKEN__VBD__LITERAL__TRUE) {
957 if (g_flags.output_format == file_format::json) {
958 ptr = "true";
959 len = 4;
960 } else {
961 ptr = "\xF5";
962 len = 1;
963 }
964 } else {
965 return "main: internal error: unexpected write_literal argument";
966 }
967 return write_dst(ptr, len);
968}
969
970// ----
971
972const char* //
Nigel Tao664f8432020-07-16 21:25:14 +1000973write_number_as_cbor_f64(double f) {
Nigel Tao168f60a2020-07-14 13:19:33 +1000974 uint8_t buf[9];
975 wuffs_base__lossy_value_u16 lv16 =
976 wuffs_base__ieee_754_bit_representation__from_f64_to_u16_truncate(f);
977 if (!lv16.lossy) {
978 buf[0] = 0xF9;
979 wuffs_base__store_u16be__no_bounds_check(&buf[1], lv16.value);
980 return write_dst(&buf[0], 3);
981 }
982 wuffs_base__lossy_value_u32 lv32 =
983 wuffs_base__ieee_754_bit_representation__from_f64_to_u32_truncate(f);
984 if (!lv32.lossy) {
985 buf[0] = 0xFA;
986 wuffs_base__store_u32be__no_bounds_check(&buf[1], lv32.value);
987 return write_dst(&buf[0], 5);
988 }
989 buf[0] = 0xFB;
990 wuffs_base__store_u64be__no_bounds_check(
991 &buf[1], wuffs_base__ieee_754_bit_representation__from_f64_to_u64(f));
992 return write_dst(&buf[0], 9);
993}
994
995const char* //
Nigel Tao664f8432020-07-16 21:25:14 +1000996write_number_as_cbor_u64(uint8_t base, uint64_t u) {
Nigel Tao168f60a2020-07-14 13:19:33 +1000997 uint8_t buf[9];
998 if (u < 0x18) {
999 buf[0] = base | ((uint8_t)u);
1000 return write_dst(&buf[0], 1);
1001 } else if ((u >> 8) == 0) {
1002 buf[0] = base | 0x18;
1003 buf[1] = ((uint8_t)u);
1004 return write_dst(&buf[0], 2);
1005 } else if ((u >> 16) == 0) {
1006 buf[0] = base | 0x19;
1007 wuffs_base__store_u16be__no_bounds_check(&buf[1], ((uint16_t)u));
1008 return write_dst(&buf[0], 3);
1009 } else if ((u >> 32) == 0) {
1010 buf[0] = base | 0x1A;
1011 wuffs_base__store_u32be__no_bounds_check(&buf[1], ((uint32_t)u));
1012 return write_dst(&buf[0], 5);
1013 }
1014 buf[0] = base | 0x1B;
1015 wuffs_base__store_u64be__no_bounds_check(&buf[1], u);
1016 return write_dst(&buf[0], 9);
1017}
1018
1019const char* //
Nigel Tao850dc182020-07-21 22:52:04 +10001020write_cbor_minus_1_minus_x(uint8_t* ptr, size_t len) {
1021 if (len != 9) {
1022 return "main: internal error: invalid ETC__MINUS_1_MINUS_X token length";
Nigel Tao664f8432020-07-16 21:25:14 +10001023 }
Nigel Tao850dc182020-07-21 22:52:04 +10001024 uint64_t u = 1 + wuffs_base__load_u64be__no_bounds_check(ptr + 1);
1025 if (u == 0) {
1026 // See the cbor.TOKEN_VALUE_MINOR__MINUS_1_MINUS_X comment re overflow.
1027 return write_dst("-18446744073709551616", 21);
Nigel Tao664f8432020-07-16 21:25:14 +10001028 }
1029 uint8_t buf[1 + WUFFS_BASE__U64__BYTE_LENGTH__MAX_INCL];
1030 uint8_t* b = &buf[0];
Nigel Tao850dc182020-07-21 22:52:04 +10001031 *b++ = '-';
Nigel Tao664f8432020-07-16 21:25:14 +10001032 size_t n = wuffs_base__render_number_u64(
1033 wuffs_base__make_slice_u8(b, WUFFS_BASE__U64__BYTE_LENGTH__MAX_INCL), u,
1034 WUFFS_BASE__RENDER_NUMBER_XXX__DEFAULT_OPTIONS);
Nigel Tao850dc182020-07-21 22:52:04 +10001035 return write_dst(&buf[0], 1 + n);
Nigel Tao664f8432020-07-16 21:25:14 +10001036}
1037
1038const char* //
Nigel Tao168f60a2020-07-14 13:19:33 +10001039write_number(uint64_t vbd, uint8_t* ptr, size_t len) {
Nigel Tao4e193592020-07-15 12:48:57 +10001040 if (g_flags.output_format == file_format::json) {
Nigel Tao51a38292020-07-19 22:43:17 +10001041 if (g_flags.input_format == file_format::json) {
Nigel Tao168f60a2020-07-14 13:19:33 +10001042 return write_dst(ptr, len);
1043 }
1044
Nigel Tao4e193592020-07-15 12:48:57 +10001045 // From here on, (g_flags.output_format == file_format::cbor).
Nigel Tao4e193592020-07-15 12:48:57 +10001046 } else if (vbd & WUFFS_BASE__TOKEN__VBD__NUMBER__FORMAT_TEXT) {
Nigel Tao168f60a2020-07-14 13:19:33 +10001047 // First try to parse (ptr, len) as an integer. Something like
1048 // "1180591620717411303424" is a valid number (in the JSON sense) but will
1049 // overflow int64_t or uint64_t, so fall back to parsing it as a float64.
1050 if (vbd & WUFFS_BASE__TOKEN__VBD__NUMBER__CONTENT_INTEGER_SIGNED) {
1051 if ((len > 0) && (ptr[0] == '-')) {
1052 wuffs_base__result_i64 ri = wuffs_base__parse_number_i64(
1053 wuffs_base__make_slice_u8(ptr, len),
1054 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
1055 if (ri.status.is_ok()) {
Nigel Tao664f8432020-07-16 21:25:14 +10001056 return write_number_as_cbor_u64(0x20, ~ri.value);
Nigel Tao168f60a2020-07-14 13:19:33 +10001057 }
1058 } else {
1059 wuffs_base__result_u64 ru = wuffs_base__parse_number_u64(
1060 wuffs_base__make_slice_u8(ptr, len),
1061 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
1062 if (ru.status.is_ok()) {
Nigel Tao664f8432020-07-16 21:25:14 +10001063 return write_number_as_cbor_u64(0x00, ru.value);
Nigel Tao168f60a2020-07-14 13:19:33 +10001064 }
1065 }
1066 }
1067
1068 if (vbd & WUFFS_BASE__TOKEN__VBD__NUMBER__CONTENT_FLOATING_POINT) {
1069 wuffs_base__result_f64 rf = wuffs_base__parse_number_f64(
1070 wuffs_base__make_slice_u8(ptr, len),
1071 WUFFS_BASE__PARSE_NUMBER_XXX__DEFAULT_OPTIONS);
1072 if (rf.status.is_ok()) {
Nigel Tao664f8432020-07-16 21:25:14 +10001073 return write_number_as_cbor_f64(rf.value);
Nigel Tao168f60a2020-07-14 13:19:33 +10001074 }
1075 }
Nigel Tao51a38292020-07-19 22:43:17 +10001076 } else if (vbd & WUFFS_BASE__TOKEN__VBD__NUMBER__CONTENT_NEG_INF) {
1077 return write_dst("\xF9\xFC\x00", 3);
1078 } else if (vbd & WUFFS_BASE__TOKEN__VBD__NUMBER__CONTENT_POS_INF) {
1079 return write_dst("\xF9\x7C\x00", 3);
1080 } else if (vbd & WUFFS_BASE__TOKEN__VBD__NUMBER__CONTENT_NEG_NAN) {
1081 return write_dst("\xF9\xFF\xFF", 3);
1082 } else if (vbd & WUFFS_BASE__TOKEN__VBD__NUMBER__CONTENT_POS_NAN) {
1083 return write_dst("\xF9\x7F\xFF", 3);
Nigel Tao168f60a2020-07-14 13:19:33 +10001084 }
1085
Nigel Tao4e193592020-07-15 12:48:57 +10001086fail:
Nigel Tao168f60a2020-07-14 13:19:33 +10001087 return "main: internal error: unexpected write_number argument";
1088}
1089
Nigel Tao4e193592020-07-15 12:48:57 +10001090const char* //
Nigel Taoc9d4e342020-07-21 15:20:34 +10001091write_inline_integer(uint64_t x, bool x_is_signed, uint8_t* ptr, size_t len) {
Nigel Tao4e193592020-07-15 12:48:57 +10001092 if (g_flags.output_format == file_format::cbor) {
1093 return write_dst(ptr, len);
1094 }
1095
Nigel Taoc9d4e342020-07-21 15:20:34 +10001096 // Adding the two ETC__BYTE_LENGTH__ETC constants is overkill, but it's
1097 // simpler (for producing a constant-expression array size) than taking the
1098 // maximum of the two.
1099 uint8_t buf[WUFFS_BASE__I64__BYTE_LENGTH__MAX_INCL +
1100 WUFFS_BASE__U64__BYTE_LENGTH__MAX_INCL];
1101 wuffs_base__slice_u8 dst = wuffs_base__make_slice_u8(&buf[0], sizeof buf);
1102 size_t n =
1103 x_is_signed
1104 ? wuffs_base__render_number_i64(
1105 dst, (int64_t)x, WUFFS_BASE__RENDER_NUMBER_XXX__DEFAULT_OPTIONS)
1106 : wuffs_base__render_number_u64(
1107 dst, x, WUFFS_BASE__RENDER_NUMBER_XXX__DEFAULT_OPTIONS);
Nigel Tao4e193592020-07-15 12:48:57 +10001108 return write_dst(&buf[0], n);
1109}
1110
Nigel Tao168f60a2020-07-14 13:19:33 +10001111// ----
1112
Nigel Tao2914bae2020-02-26 09:40:30 +11001113uint8_t //
1114hex_digit(uint8_t nibble) {
Nigel Taob5461bd2020-02-21 14:13:37 +11001115 nibble &= 0x0F;
1116 if (nibble <= 9) {
1117 return '0' + nibble;
1118 }
1119 return ('A' - 10) + nibble;
1120}
1121
Nigel Tao2914bae2020-02-26 09:40:30 +11001122const char* //
Nigel Tao168f60a2020-07-14 13:19:33 +10001123flush_cbor_output_string() {
1124 uint8_t prefix[3];
1125 prefix[0] = g_cbor_output_string_is_utf_8 ? 0x60 : 0x40;
1126 if (g_cbor_output_string_length < 0x18) {
1127 prefix[0] |= g_cbor_output_string_length;
1128 TRY(write_dst(&prefix[0], 1));
1129 } else if (g_cbor_output_string_length <= 0xFF) {
1130 prefix[0] |= 0x18;
1131 prefix[1] = g_cbor_output_string_length;
1132 TRY(write_dst(&prefix[0], 2));
1133 } else if (g_cbor_output_string_length <= 0xFFFF) {
1134 prefix[0] |= 0x19;
1135 prefix[1] = g_cbor_output_string_length >> 8;
1136 prefix[2] = g_cbor_output_string_length;
1137 TRY(write_dst(&prefix[0], 3));
1138 } else {
1139 return "main: internal error: CBOR string output is too long";
1140 }
1141
1142 size_t n = g_cbor_output_string_length;
1143 g_cbor_output_string_length = 0;
1144 return write_dst(&g_cbor_output_string_array[0], n);
1145}
1146
1147const char* //
1148write_cbor_output_string(uint8_t* ptr, size_t len, bool finish) {
1149 // Check that g_cbor_output_string_array can hold any UTF-8 code point.
1150 if (CBOR_OUTPUT_STRING_ARRAY_SIZE < 4) {
1151 return "main: internal error: CBOR_OUTPUT_STRING_ARRAY_SIZE is too short";
1152 }
1153
1154 while (len > 0) {
1155 size_t available =
1156 CBOR_OUTPUT_STRING_ARRAY_SIZE - g_cbor_output_string_length;
1157 if (available >= len) {
1158 memcpy(&g_cbor_output_string_array[g_cbor_output_string_length], ptr,
1159 len);
1160 g_cbor_output_string_length += len;
1161 ptr += len;
1162 len = 0;
1163 break;
1164
1165 } else if (available > 0) {
1166 if (!g_cbor_output_string_is_multiple_chunks) {
1167 g_cbor_output_string_is_multiple_chunks = true;
1168 TRY(write_dst(g_cbor_output_string_is_utf_8 ? "\x7F" : "\x5F", 1));
Nigel Tao3b486982020-02-27 15:05:59 +11001169 }
Nigel Tao168f60a2020-07-14 13:19:33 +10001170
1171 if (g_cbor_output_string_is_utf_8) {
1172 // Walk the end backwards to a UTF-8 boundary, so that each chunk of
1173 // the multi-chunk string is also valid UTF-8.
1174 while (available > 0) {
Nigel Tao702c7b22020-07-22 15:42:54 +10001175 wuffs_base__utf_8__next__output o =
1176 wuffs_base__utf_8__next_from_end(ptr, available);
Nigel Tao168f60a2020-07-14 13:19:33 +10001177 if ((o.code_point != WUFFS_BASE__UNICODE_REPLACEMENT_CHARACTER) ||
1178 (o.byte_length != 1)) {
1179 break;
1180 }
1181 available--;
1182 }
1183 }
1184
1185 memcpy(&g_cbor_output_string_array[g_cbor_output_string_length], ptr,
1186 available);
1187 g_cbor_output_string_length += available;
1188 ptr += available;
1189 len -= available;
Nigel Tao3b486982020-02-27 15:05:59 +11001190 }
1191
Nigel Tao168f60a2020-07-14 13:19:33 +10001192 TRY(flush_cbor_output_string());
1193 }
Nigel Taob9ad34f2020-03-03 12:44:01 +11001194
Nigel Tao168f60a2020-07-14 13:19:33 +10001195 if (finish) {
1196 TRY(flush_cbor_output_string());
1197 if (g_cbor_output_string_is_multiple_chunks) {
1198 TRY(write_dst("\xFF", 1));
1199 }
1200 }
1201 return nullptr;
1202}
Nigel Taob9ad34f2020-03-03 12:44:01 +11001203
Nigel Tao168f60a2020-07-14 13:19:33 +10001204const char* //
Nigel Tao7cb76542020-07-19 22:19:04 +10001205handle_unicode_code_point(uint32_t ucp) {
1206 if (g_flags.output_format == file_format::json) {
1207 if (ucp < 0x0020) {
1208 switch (ucp) {
1209 case '\b':
1210 return write_dst("\\b", 2);
1211 case '\f':
1212 return write_dst("\\f", 2);
1213 case '\n':
1214 return write_dst("\\n", 2);
1215 case '\r':
1216 return write_dst("\\r", 2);
1217 case '\t':
1218 return write_dst("\\t", 2);
1219 }
1220
1221 // Other bytes less than 0x0020 are valid UTF-8 but not valid in a
1222 // JSON string. They need to remain escaped.
1223 uint8_t esc6[6];
1224 esc6[0] = '\\';
1225 esc6[1] = 'u';
1226 esc6[2] = '0';
1227 esc6[3] = '0';
1228 esc6[4] = hex_digit(ucp >> 4);
1229 esc6[5] = hex_digit(ucp >> 0);
1230 return write_dst(&esc6[0], 6);
1231
1232 } else if (ucp == '\"') {
1233 return write_dst("\\\"", 2);
1234
1235 } else if (ucp == '\\') {
1236 return write_dst("\\\\", 2);
1237 }
1238 }
1239
1240 uint8_t u[WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL];
1241 size_t n = wuffs_base__utf_8__encode(
1242 wuffs_base__make_slice_u8(&u[0],
1243 WUFFS_BASE__UTF_8__BYTE_LENGTH__MAX_INCL),
1244 ucp);
1245 if (n == 0) {
1246 return "main: internal error: unexpected Unicode code point";
1247 }
1248
1249 if (g_flags.output_format == file_format::json) {
1250 return write_dst(&u[0], n);
1251 }
1252 return write_cbor_output_string(&u[0], n, false);
1253}
Nigel Taod191a3f2020-07-19 22:14:54 +10001254
1255const char* //
1256write_json_escaped_string(uint8_t* ptr, size_t len) {
1257restart:
1258 while (true) {
1259 size_t i;
1260 for (i = 0; i < len; i++) {
1261 uint8_t c = ptr[i];
1262 if ((c == '"') || (c == '\\') || (c < 0x20)) {
1263 TRY(write_dst(ptr, i));
1264 TRY(handle_unicode_code_point(c));
1265 ptr += i + 1;
1266 len -= i + 1;
1267 goto restart;
1268 }
1269 }
1270 TRY(write_dst(ptr, len));
1271 break;
1272 }
1273 return nullptr;
1274}
1275
1276const char* //
Nigel Tao168f60a2020-07-14 13:19:33 +10001277handle_string(uint64_t vbd,
1278 uint64_t len,
1279 bool start_of_token_chain,
1280 bool continued) {
1281 if (start_of_token_chain) {
1282 if (g_flags.output_format == file_format::json) {
Nigel Tao3c8589b2020-07-19 21:49:00 +10001283 if (g_flags.output_cbor_metadata_as_json_comments &&
1284 !(vbd & WUFFS_BASE__TOKEN__VBD__STRING__CHAIN_MUST_BE_UTF_8)) {
1285 TRY(write_dst("/*cbor:hex*/\"", 13));
1286 } else {
1287 TRY(write_dst("\"", 1));
1288 }
Nigel Tao168f60a2020-07-14 13:19:33 +10001289 } else {
1290 g_cbor_output_string_length = 0;
1291 g_cbor_output_string_is_multiple_chunks = false;
1292 g_cbor_output_string_is_utf_8 =
1293 vbd & WUFFS_BASE__TOKEN__VBD__STRING__CHAIN_MUST_BE_UTF_8;
1294 }
1295 g_query.restart_fragment(in_dict_before_key() && g_query.is_at(g_depth));
1296 }
1297
1298 if (vbd & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_0_DST_1_SRC_DROP) {
1299 // No-op.
1300 } else if (vbd & WUFFS_BASE__TOKEN__VBD__STRING__CONVERT_1_DST_1_SRC_COPY) {
1301 uint8_t* ptr = g_src.data.ptr + g_curr_token_end_src_index - len;
1302 if (g_flags.output_format == file_format::json) {
Nigel Taoaf757722020-07-18 17:27:11 +10001303 if (g_flags.input_format == file_format::json) {
1304 TRY(write_dst(ptr, len));
1305 } else if (vbd & WUFFS_BASE__TOKEN__VBD__STRING__CHAIN_MUST_BE_UTF_8) {
Nigel Taod191a3f2020-07-19 22:14:54 +10001306 TRY(write_json_escaped_string(ptr, len));
Nigel Taoaf757722020-07-18 17:27:11 +10001307 } else {
1308 uint8_t as_hex[512];
1309 uint8_t* p = ptr;
1310 size_t n = len;
1311 while (n > 0) {
1312 wuffs_base__transform__output o = wuffs_base__base_16__encode2(
1313 wuffs_base__make_slice_u8(&as_hex[0], sizeof as_hex),
1314 wuffs_base__make_slice_u8(p, n), true,
1315 WUFFS_BASE__BASE_16__DEFAULT_OPTIONS);
1316 TRY(write_dst(&as_hex[0], o.num_dst));
1317 p += o.num_src;
1318 n -= o.num_src;
1319 if (!o.status.is_ok()) {
1320 return o.status.message();
1321 }
1322 }
1323 }
Nigel Tao168f60a2020-07-14 13:19:33 +10001324 } else {
1325 TRY(write_cbor_output_string(ptr, len, false));
1326 }
1327 g_query.incremental_match_slice(ptr, len);
Nigel Taob9ad34f2020-03-03 12:44:01 +11001328 } else {
Nigel Tao168f60a2020-07-14 13:19:33 +10001329 return "main: internal error: unexpected string-token conversion";
1330 }
1331
1332 if (continued) {
1333 return nullptr;
1334 }
1335
1336 if (g_flags.output_format == file_format::json) {
1337 TRY(write_dst("\"", 1));
1338 } else {
1339 TRY(write_cbor_output_string(nullptr, 0, true));
1340 }
1341 return nullptr;
1342}
1343
Nigel Taod191a3f2020-07-19 22:14:54 +10001344// ----
1345
Nigel Tao3b486982020-02-27 15:05:59 +11001346const char* //
Nigel Tao2ef39992020-04-09 17:24:39 +10001347handle_token(wuffs_base__token t, bool start_of_token_chain) {
Nigel Tao2cf76db2020-02-27 22:42:01 +11001348 do {
Nigel Tao462f8662020-04-01 23:01:51 +11001349 int64_t vbc = t.value_base_category();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001350 uint64_t vbd = t.value_base_detail();
1351 uint64_t len = t.length();
Nigel Tao1b073492020-02-16 22:11:36 +11001352
1353 // Handle ']' or '}'.
Nigel Tao9f7a2502020-02-23 09:42:02 +11001354 if ((vbc == WUFFS_BASE__TOKEN__VBC__STRUCTURE) &&
Nigel Tao2cf76db2020-02-27 22:42:01 +11001355 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__POP)) {
Nigel Taod60815c2020-03-26 14:32:35 +11001356 if (g_query.is_at(g_depth)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001357 return "main: no match for query";
1358 }
Nigel Taod60815c2020-03-26 14:32:35 +11001359 if (g_depth <= 0) {
1360 return "main: internal error: inconsistent g_depth";
Nigel Tao1b073492020-02-16 22:11:36 +11001361 }
Nigel Taod60815c2020-03-26 14:32:35 +11001362 g_depth--;
Nigel Tao1b073492020-02-16 22:11:36 +11001363
Nigel Taod60815c2020-03-26 14:32:35 +11001364 if (g_query.matched_all() && (g_depth >= g_flags.max_output_depth)) {
1365 g_suppress_write_dst--;
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001366 // '…' is U+2026 HORIZONTAL ELLIPSIS, which is 3 UTF-8 bytes.
Nigel Tao168f60a2020-07-14 13:19:33 +10001367 if (g_flags.output_format == file_format::json) {
1368 TRY(write_dst((vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST)
1369 ? "\"[…]\""
1370 : "\"{…}\"",
1371 7));
1372 } else {
1373 TRY(write_dst((vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST)
1374 ? "\x65[…]"
1375 : "\x65{…}",
1376 6));
1377 }
1378 } else if (g_flags.output_format == file_format::json) {
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001379 // Write preceding whitespace.
Nigel Taod60815c2020-03-26 14:32:35 +11001380 if ((g_ctx != context::in_list_after_bracket) &&
1381 (g_ctx != context::in_dict_after_brace) &&
1382 !g_flags.compact_output) {
Nigel Taoc766bb72020-07-09 12:59:32 +10001383 if (g_flags.output_json_extra_comma) {
1384 TRY(write_dst(",\n", 2));
1385 } else {
1386 TRY(write_dst("\n", 1));
1387 }
Nigel Taod60815c2020-03-26 14:32:35 +11001388 for (uint32_t i = 0; i < g_depth; i++) {
1389 TRY(write_dst(
1390 g_flags.tabs ? INDENT_TAB_STRING : INDENT_SPACES_STRING,
Nigel Taoecadf722020-07-13 08:22:34 +10001391 g_flags.tabs ? 1 : g_flags.spaces));
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001392 }
Nigel Tao1b073492020-02-16 22:11:36 +11001393 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001394
1395 TRY(write_dst(
1396 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__FROM_LIST) ? "]" : "}",
1397 1));
Nigel Tao168f60a2020-07-14 13:19:33 +10001398 } else {
1399 TRY(write_dst("\xFF", 1));
Nigel Tao1b073492020-02-16 22:11:36 +11001400 }
1401
Nigel Taod60815c2020-03-26 14:32:35 +11001402 g_ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
1403 ? context::in_list_after_value
1404 : context::in_dict_after_key;
Nigel Tao1b073492020-02-16 22:11:36 +11001405 goto after_value;
1406 }
1407
Nigel Taod1c928a2020-02-28 12:43:53 +11001408 // Write preceding whitespace and punctuation, if it wasn't ']', '}' or a
1409 // continuation of a multi-token chain.
Nigel Tao2ef39992020-04-09 17:24:39 +10001410 if (start_of_token_chain) {
Nigel Tao168f60a2020-07-14 13:19:33 +10001411 if (g_flags.output_format != file_format::json) {
1412 // No-op.
1413 } else if (g_ctx == context::in_dict_after_key) {
Nigel Taod60815c2020-03-26 14:32:35 +11001414 TRY(write_dst(": ", g_flags.compact_output ? 1 : 2));
1415 } else if (g_ctx != context::none) {
Nigel Taof8dfc762020-07-23 23:35:44 +10001416 if ((g_ctx == context::in_dict_after_brace) ||
1417 (g_ctx == context::in_dict_after_value)) {
1418 // Reject dict keys that aren't UTF-8 strings, which could otherwise
1419 // happen with -i=cbor -o=json.
1420 if ((vbc != WUFFS_BASE__TOKEN__VBC__STRING) ||
1421 !(vbd & WUFFS_BASE__TOKEN__VBD__STRING__CHAIN_MUST_BE_UTF_8)) {
1422 return "main: cannot convert CBOR non-text-string to JSON map key";
1423 }
1424 }
1425 if ((g_ctx == context::in_list_after_value) ||
1426 (g_ctx == context::in_dict_after_value)) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001427 TRY(write_dst(",", 1));
Nigel Tao107f0ef2020-03-01 21:35:02 +11001428 }
Nigel Taod60815c2020-03-26 14:32:35 +11001429 if (!g_flags.compact_output) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001430 TRY(write_dst("\n", 1));
Nigel Taod60815c2020-03-26 14:32:35 +11001431 for (size_t i = 0; i < g_depth; i++) {
1432 TRY(write_dst(
1433 g_flags.tabs ? INDENT_TAB_STRING : INDENT_SPACES_STRING,
Nigel Taoecadf722020-07-13 08:22:34 +10001434 g_flags.tabs ? 1 : g_flags.spaces));
Nigel Tao0cd2f982020-03-03 23:03:02 +11001435 }
1436 }
1437 }
1438
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001439 bool query_matched_fragment = false;
Nigel Taod60815c2020-03-26 14:32:35 +11001440 if (g_query.is_at(g_depth)) {
1441 switch (g_ctx) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001442 case context::in_list_after_bracket:
1443 case context::in_list_after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001444 query_matched_fragment = g_query.tick();
Nigel Tao0cd2f982020-03-03 23:03:02 +11001445 break;
1446 case context::in_dict_after_key:
Nigel Taod60815c2020-03-26 14:32:35 +11001447 query_matched_fragment = g_query.matched_fragment();
Nigel Tao0cd2f982020-03-03 23:03:02 +11001448 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +11001449 default:
1450 break;
Nigel Tao0cd2f982020-03-03 23:03:02 +11001451 }
1452 }
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001453 if (!query_matched_fragment) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001454 // No-op.
Nigel Taod60815c2020-03-26 14:32:35 +11001455 } else if (!g_query.next_fragment()) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001456 // There is no next fragment. We have matched the complete query, and
1457 // the upcoming JSON value is the result of that query.
1458 //
Nigel Taod60815c2020-03-26 14:32:35 +11001459 // Un-suppress writing to stdout and reset the g_ctx and g_depth as if
1460 // we were about to decode a top-level value. This makes any subsequent
1461 // indentation be relative to this point, and we will return g_eod
1462 // after the upcoming JSON value is complete.
1463 if (g_suppress_write_dst != 1) {
1464 return "main: internal error: inconsistent g_suppress_write_dst";
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001465 }
Nigel Taod60815c2020-03-26 14:32:35 +11001466 g_suppress_write_dst = 0;
1467 g_ctx = context::none;
1468 g_depth = 0;
Nigel Tao0cd2f982020-03-03 23:03:02 +11001469 } else if ((vbc != WUFFS_BASE__TOKEN__VBC__STRUCTURE) ||
1470 !(vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__PUSH)) {
1471 // The query has moved on to the next fragment but the upcoming JSON
1472 // value is not a container.
1473 return "main: no match for query";
Nigel Tao1b073492020-02-16 22:11:36 +11001474 }
1475 }
1476
1477 // Handle the token itself: either a container ('[' or '{') or a simple
Nigel Tao85fba7f2020-02-29 16:28:06 +11001478 // value: string (a chain of raw or escaped parts), literal or number.
Nigel Tao1b073492020-02-16 22:11:36 +11001479 switch (vbc) {
Nigel Tao85fba7f2020-02-29 16:28:06 +11001480 case WUFFS_BASE__TOKEN__VBC__STRUCTURE:
Nigel Taod60815c2020-03-26 14:32:35 +11001481 if (g_query.matched_all() && (g_depth >= g_flags.max_output_depth)) {
1482 g_suppress_write_dst++;
Nigel Tao168f60a2020-07-14 13:19:33 +10001483 } else if (g_flags.output_format == file_format::json) {
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001484 TRY(write_dst(
1485 (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST) ? "[" : "{",
1486 1));
Nigel Tao168f60a2020-07-14 13:19:33 +10001487 } else {
1488 TRY(write_dst((vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
1489 ? "\x9F"
1490 : "\xBF",
1491 1));
Nigel Tao52c4d6a2020-03-08 21:12:38 +11001492 }
Nigel Taod60815c2020-03-26 14:32:35 +11001493 g_depth++;
1494 g_ctx = (vbd & WUFFS_BASE__TOKEN__VBD__STRUCTURE__TO_LIST)
1495 ? context::in_list_after_bracket
1496 : context::in_dict_after_brace;
Nigel Tao85fba7f2020-02-29 16:28:06 +11001497 return nullptr;
1498
Nigel Tao2cf76db2020-02-27 22:42:01 +11001499 case WUFFS_BASE__TOKEN__VBC__STRING:
Nigel Tao168f60a2020-07-14 13:19:33 +10001500 TRY(handle_string(vbd, len, start_of_token_chain, t.continued()));
Nigel Tao496e88b2020-04-09 22:10:08 +10001501 if (t.continued()) {
Nigel Tao2cf76db2020-02-27 22:42:01 +11001502 return nullptr;
1503 }
Nigel Tao2cf76db2020-02-27 22:42:01 +11001504 goto after_value;
1505
1506 case WUFFS_BASE__TOKEN__VBC__UNICODE_CODE_POINT:
Nigel Tao496e88b2020-04-09 22:10:08 +10001507 if (!t.continued()) {
1508 return "main: internal error: unexpected non-continued UCP token";
Nigel Tao0cd2f982020-03-03 23:03:02 +11001509 }
1510 TRY(handle_unicode_code_point(vbd));
Nigel Taod60815c2020-03-26 14:32:35 +11001511 g_query.incremental_match_code_point(vbd);
Nigel Tao0cd2f982020-03-03 23:03:02 +11001512 return nullptr;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001513
Nigel Tao85fba7f2020-02-29 16:28:06 +11001514 case WUFFS_BASE__TOKEN__VBC__LITERAL:
Nigel Tao168f60a2020-07-14 13:19:33 +10001515 TRY(write_literal(vbd));
1516 goto after_value;
1517
Nigel Tao2cf76db2020-02-27 22:42:01 +11001518 case WUFFS_BASE__TOKEN__VBC__NUMBER:
Nigel Tao168f60a2020-07-14 13:19:33 +10001519 TRY(write_number(vbd, g_src.data.ptr + g_curr_token_end_src_index - len,
1520 len));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001521 goto after_value;
Nigel Tao4e193592020-07-15 12:48:57 +10001522
Nigel Taoc9d4e342020-07-21 15:20:34 +10001523 case WUFFS_BASE__TOKEN__VBC__INLINE_INTEGER_SIGNED:
1524 case WUFFS_BASE__TOKEN__VBC__INLINE_INTEGER_UNSIGNED: {
1525 bool x_is_signed = vbc == WUFFS_BASE__TOKEN__VBC__INLINE_INTEGER_SIGNED;
1526 uint64_t x = x_is_signed
1527 ? ((uint64_t)(t.value_base_detail__sign_extended()))
1528 : vbd;
Nigel Tao850dc182020-07-21 22:52:04 +10001529 if (t.continued()) {
Nigel Tao03a87ea2020-07-21 23:29:26 +10001530 if (len != 0) {
1531 return "main: internal error: unexpected to-be-extended length";
1532 }
Nigel Tao850dc182020-07-21 22:52:04 +10001533 g_token_extension.category = vbc;
1534 g_token_extension.detail = x;
1535 return nullptr;
1536 }
Nigel Tao4e193592020-07-15 12:48:57 +10001537 TRY(write_inline_integer(
Nigel Taoc9d4e342020-07-21 15:20:34 +10001538 x, x_is_signed, g_src.data.ptr + g_curr_token_end_src_index - len,
1539 len));
Nigel Tao4e193592020-07-15 12:48:57 +10001540 goto after_value;
Nigel Taoc9d4e342020-07-21 15:20:34 +10001541 }
Nigel Tao1b073492020-02-16 22:11:36 +11001542 }
1543
Nigel Tao850dc182020-07-21 22:52:04 +10001544 int64_t ext = t.value_extension();
1545 if (ext >= 0) {
1546 switch (g_token_extension.category) {
1547 case WUFFS_BASE__TOKEN__VBC__INLINE_INTEGER_SIGNED:
1548 case WUFFS_BASE__TOKEN__VBC__INLINE_INTEGER_UNSIGNED:
1549 uint64_t x = (g_token_extension.detail
1550 << WUFFS_BASE__TOKEN__VALUE_EXTENSION__NUM_BITS) |
1551 ((uint64_t)ext);
1552 TRY(write_inline_integer(
1553 x,
1554 g_token_extension.category ==
1555 WUFFS_BASE__TOKEN__VBC__INLINE_INTEGER_SIGNED,
1556 g_src.data.ptr + g_curr_token_end_src_index - len, len));
1557 g_token_extension.category = 0;
1558 g_token_extension.detail = 0;
1559 goto after_value;
1560 }
1561 }
1562
Nigel Tao664f8432020-07-16 21:25:14 +10001563 if (t.value_major() == WUFFS_CBOR__TOKEN_VALUE_MAJOR) {
1564 uint64_t value_minor = t.value_minor();
1565 if (value_minor & WUFFS_CBOR__TOKEN_VALUE_MINOR__TAG) {
1566 // TODO: CBOR tags.
1567 } else if (value_minor & WUFFS_CBOR__TOKEN_VALUE_MINOR__MINUS_1_MINUS_X) {
Nigel Tao850dc182020-07-21 22:52:04 +10001568 TRY(write_cbor_minus_1_minus_x(
1569 g_src.data.ptr + g_curr_token_end_src_index - len, len));
Nigel Tao664f8432020-07-16 21:25:14 +10001570 goto after_value;
1571 }
1572 }
1573
1574 // Return an error if we didn't match the (value_major, value_minor) or
1575 // (vbc, vbd) pair.
Nigel Tao2cf76db2020-02-27 22:42:01 +11001576 return "main: internal error: unexpected token";
1577 } while (0);
Nigel Tao1b073492020-02-16 22:11:36 +11001578
Nigel Tao2cf76db2020-02-27 22:42:01 +11001579 // Book-keeping after completing a value (whether a container value or a
1580 // simple value). Empty parent containers are no longer empty. If the parent
1581 // container is a "{...}" object, toggle between keys and values.
1582after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001583 if (g_depth == 0) {
1584 return g_eod;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001585 }
Nigel Taod60815c2020-03-26 14:32:35 +11001586 switch (g_ctx) {
Nigel Tao2cf76db2020-02-27 22:42:01 +11001587 case context::in_list_after_bracket:
Nigel Taod60815c2020-03-26 14:32:35 +11001588 g_ctx = context::in_list_after_value;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001589 break;
1590 case context::in_dict_after_brace:
Nigel Taod60815c2020-03-26 14:32:35 +11001591 g_ctx = context::in_dict_after_key;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001592 break;
1593 case context::in_dict_after_key:
Nigel Taod60815c2020-03-26 14:32:35 +11001594 g_ctx = context::in_dict_after_value;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001595 break;
1596 case context::in_dict_after_value:
Nigel Taod60815c2020-03-26 14:32:35 +11001597 g_ctx = context::in_dict_after_key;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001598 break;
Nigel Tao18ef5b42020-03-16 10:37:47 +11001599 default:
1600 break;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001601 }
1602 return nullptr;
1603}
1604
1605const char* //
1606main1(int argc, char** argv) {
1607 TRY(initialize_globals(argc, argv));
1608
Nigel Taocd183f92020-07-14 12:11:05 +10001609 bool start_of_token_chain = true;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001610 while (true) {
Nigel Tao4e193592020-07-15 12:48:57 +10001611 wuffs_base__status status = g_dec->decode_tokens(
Nigel Taod60815c2020-03-26 14:32:35 +11001612 &g_tok, &g_src,
1613 wuffs_base__make_slice_u8(g_work_buffer_array, WORK_BUFFER_ARRAY_SIZE));
Nigel Tao2cf76db2020-02-27 22:42:01 +11001614
Nigel Taod60815c2020-03-26 14:32:35 +11001615 while (g_tok.meta.ri < g_tok.meta.wi) {
1616 wuffs_base__token t = g_tok.data.ptr[g_tok.meta.ri++];
Nigel Tao2cf76db2020-02-27 22:42:01 +11001617 uint64_t n = t.length();
Nigel Taod60815c2020-03-26 14:32:35 +11001618 if ((g_src.meta.ri - g_curr_token_end_src_index) < n) {
1619 return "main: internal error: inconsistent g_src indexes";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001620 }
Nigel Taod60815c2020-03-26 14:32:35 +11001621 g_curr_token_end_src_index += n;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001622
Nigel Taod0b16cb2020-03-14 10:15:54 +11001623 // Skip filler tokens (e.g. whitespace).
Nigel Tao3c8589b2020-07-19 21:49:00 +10001624 if (t.value_base_category() == WUFFS_BASE__TOKEN__VBC__FILLER) {
Nigel Tao496e88b2020-04-09 22:10:08 +10001625 start_of_token_chain = !t.continued();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001626 continue;
1627 }
1628
Nigel Tao2ef39992020-04-09 17:24:39 +10001629 const char* z = handle_token(t, start_of_token_chain);
Nigel Tao496e88b2020-04-09 22:10:08 +10001630 start_of_token_chain = !t.continued();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001631 if (z == nullptr) {
1632 continue;
Nigel Taod60815c2020-03-26 14:32:35 +11001633 } else if (z == g_eod) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001634 goto end_of_data;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001635 }
1636 return z;
Nigel Tao1b073492020-02-16 22:11:36 +11001637 }
Nigel Tao2cf76db2020-02-27 22:42:01 +11001638
1639 if (status.repr == nullptr) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001640 return "main: internal error: unexpected end of token stream";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001641 } else if (status.repr == wuffs_base__suspension__short_read) {
Nigel Taod60815c2020-03-26 14:32:35 +11001642 if (g_curr_token_end_src_index != g_src.meta.ri) {
1643 return "main: internal error: inconsistent g_src indexes";
Nigel Tao2cf76db2020-02-27 22:42:01 +11001644 }
1645 TRY(read_src());
Nigel Taod60815c2020-03-26 14:32:35 +11001646 g_curr_token_end_src_index = g_src.meta.ri;
Nigel Tao2cf76db2020-02-27 22:42:01 +11001647 } else if (status.repr == wuffs_base__suspension__short_write) {
Nigel Taod60815c2020-03-26 14:32:35 +11001648 g_tok.compact();
Nigel Tao2cf76db2020-02-27 22:42:01 +11001649 } else {
1650 return status.message();
Nigel Tao1b073492020-02-16 22:11:36 +11001651 }
1652 }
Nigel Tao0cd2f982020-03-03 23:03:02 +11001653end_of_data:
1654
Nigel Taod60815c2020-03-26 14:32:35 +11001655 // With a non-empty g_query, don't try to consume trailing whitespace or
Nigel Tao0cd2f982020-03-03 23:03:02 +11001656 // confirm that we've processed all the tokens.
Nigel Taod60815c2020-03-26 14:32:35 +11001657 if (g_flags.query_c_string && *g_flags.query_c_string) {
Nigel Tao0cd2f982020-03-03 23:03:02 +11001658 return nullptr;
1659 }
Nigel Tao6b161af2020-02-24 11:01:48 +11001660
Nigel Tao6b161af2020-02-24 11:01:48 +11001661 // Check that we've exhausted the input.
Nigel Taod60815c2020-03-26 14:32:35 +11001662 if ((g_src.meta.ri == g_src.meta.wi) && !g_src.meta.closed) {
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001663 TRY(read_src());
1664 }
Nigel Taod60815c2020-03-26 14:32:35 +11001665 if ((g_src.meta.ri < g_src.meta.wi) || !g_src.meta.closed) {
Nigel Tao51a38292020-07-19 22:43:17 +10001666 return "main: valid JSON|CBOR followed by further (unexpected) data";
Nigel Tao6b161af2020-02-24 11:01:48 +11001667 }
1668
1669 // Check that we've used all of the decoded tokens, other than trailing
Nigel Tao4b186b02020-03-18 14:25:21 +11001670 // filler tokens. For example, "true\n" is valid JSON (and fully consumed
1671 // with WUFFS_JSON__QUIRK_ALLOW_TRAILING_NEW_LINE enabled) with a trailing
1672 // filler token for the "\n".
Nigel Taod60815c2020-03-26 14:32:35 +11001673 for (; g_tok.meta.ri < g_tok.meta.wi; g_tok.meta.ri++) {
1674 if (g_tok.data.ptr[g_tok.meta.ri].value_base_category() !=
Nigel Tao6b161af2020-02-24 11:01:48 +11001675 WUFFS_BASE__TOKEN__VBC__FILLER) {
1676 return "main: internal error: decoded OK but unprocessed tokens remain";
1677 }
1678 }
1679
1680 return nullptr;
Nigel Tao1b073492020-02-16 22:11:36 +11001681}
1682
Nigel Tao2914bae2020-02-26 09:40:30 +11001683int //
1684compute_exit_code(const char* status_msg) {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001685 if (!status_msg) {
1686 return 0;
1687 }
Nigel Tao01abc842020-03-06 21:42:33 +11001688 size_t n;
Nigel Taod60815c2020-03-26 14:32:35 +11001689 if (status_msg == g_usage) {
Nigel Tao01abc842020-03-06 21:42:33 +11001690 n = strlen(status_msg);
1691 } else {
Nigel Tao9cc2c252020-02-23 17:05:49 +11001692 n = strnlen(status_msg, 2047);
Nigel Tao01abc842020-03-06 21:42:33 +11001693 if (n >= 2047) {
1694 status_msg = "main: internal error: error message is too long";
1695 n = strnlen(status_msg, 2047);
1696 }
Nigel Tao9cc2c252020-02-23 17:05:49 +11001697 }
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001698 const int stderr_fd = 2;
1699 ignore_return_value(write(stderr_fd, status_msg, n));
1700 ignore_return_value(write(stderr_fd, "\n", 1));
Nigel Tao9cc2c252020-02-23 17:05:49 +11001701 // Return an exit code of 1 for regular (forseen) errors, e.g. badly
1702 // formatted or unsupported input.
1703 //
1704 // Return an exit code of 2 for internal (exceptional) errors, e.g. defensive
1705 // run-time checks found that an internal invariant did not hold.
1706 //
1707 // Automated testing, including badly formatted inputs, can therefore
1708 // discriminate between expected failure (exit code 1) and unexpected failure
1709 // (other non-zero exit codes). Specifically, exit code 2 for internal
1710 // invariant violation, exit code 139 (which is 128 + SIGSEGV on x86_64
1711 // linux) for a segmentation fault (e.g. null pointer dereference).
1712 return strstr(status_msg, "internal error:") ? 2 : 1;
1713}
1714
Nigel Tao2914bae2020-02-26 09:40:30 +11001715int //
1716main(int argc, char** argv) {
Nigel Tao01abc842020-03-06 21:42:33 +11001717 // Look for an input filename (the first non-flag argument) in argv. If there
1718 // is one, open it (but do not read from it) before we self-impose a sandbox.
1719 //
1720 // Flags start with "-", unless it comes after a bare "--" arg.
1721 {
1722 bool dash_dash = false;
1723 int a;
1724 for (a = 1; a < argc; a++) {
1725 char* arg = argv[a];
1726 if ((arg[0] == '-') && !dash_dash) {
1727 dash_dash = (arg[1] == '-') && (arg[2] == '\x00');
1728 continue;
1729 }
Nigel Taod60815c2020-03-26 14:32:35 +11001730 g_input_file_descriptor = open(arg, O_RDONLY);
1731 if (g_input_file_descriptor < 0) {
Nigel Tao01abc842020-03-06 21:42:33 +11001732 fprintf(stderr, "%s: %s\n", arg, strerror(errno));
1733 return 1;
1734 }
1735 break;
1736 }
1737 }
1738
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001739#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1740 prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
Nigel Taod60815c2020-03-26 14:32:35 +11001741 g_sandboxed = true;
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001742#endif
1743
Nigel Tao0cd2f982020-03-03 23:03:02 +11001744 const char* z = main1(argc, argv);
Nigel Taod60815c2020-03-26 14:32:35 +11001745 if (g_wrote_to_dst) {
Nigel Tao168f60a2020-07-14 13:19:33 +10001746 const char* z1 = (g_flags.output_format == file_format::json)
1747 ? write_dst("\n", 1)
1748 : nullptr;
Nigel Tao0cd2f982020-03-03 23:03:02 +11001749 const char* z2 = flush_dst();
1750 z = z ? z : (z1 ? z1 : z2);
1751 }
1752 int exit_code = compute_exit_code(z);
Nigel Taofe0cbbd2020-03-05 22:01:30 +11001753
1754#if defined(WUFFS_EXAMPLE_USE_SECCOMP)
1755 // Call SYS_exit explicitly, instead of calling SYS_exit_group implicitly by
1756 // either calling _exit or returning from main. SECCOMP_MODE_STRICT allows
1757 // only SYS_exit.
1758 syscall(SYS_exit, exit_code);
1759#endif
Nigel Tao9cc2c252020-02-23 17:05:49 +11001760 return exit_code;
Nigel Tao1b073492020-02-16 22:11:36 +11001761}