Introduction
Have you ever found yourself stuck with having to write a *nix / *nux shell
script that cannot assume that Bash and system core utilites having GNU
extensions are available and so specify the shell processor simply to be
/bin/sh
and then found you could really use some sort of container to store
multiple values?
This happened to me a while ago and I thought I would attempt to create an associative container abstraction that could be used with only basic shell facilities.
Basic selection
Having decided to use "only basic shell facilities", and not being a shell
scripting guru, the next question was just what are 'basic shell facilities'?
What facilities are available for sh
? In fact what shell is used when we
use /bin/sh
? It seems that these days /bin/sh
may simply link to some
other shell program such as Dash.
A little research lead me to the Open Group's [OpenGroup] page on the POSIX (IEEE Std 1003.1-2017) Shell Command Language [ShellLang]. It seemed that trying to use only the shell language facilities as described on this page would be the most portable solution, assuming as few facilities would be available as possible.
Not quite basic
After reading through the specification of the Shell Command Language it
became apparent that one feature that I would really like to use was not
available as standard. The local
utility used to declare variables local to
a function call is not required to be implemented but is sort of reserved in
that the results of using local
has unspecified results. It seems local
variables within a function were considered but rejected, though the
identifier local was reserved just in case local function variables
make it into a future version of the POSIX standard [ShellLangRationale].
I decided that I would use local
- at least until it proved a problem,
whereby most uses could probably be replaced with careful use of global
variables. The exception would be cases where recursion was required.
However, I would assume local variables were only in scope in the call
of the declaring function, and not in functions called from the declaring
function call - similar to most other languages' local variable scoping rules.
The requirements
In the end it turned out that in addition to local
some basic use of core
utilities were required. The full requirements to use the associative
container shell script library are:
-
the POSIX Shell Command Language as described at [ShellLang].
-
the
local
utility. -
echo -n
- the use of-n
is not strictly portable. -
basic use of the
tr
character transform utility. -
basic use of the
sed
stream editor utility to perform global subsitutions.
What’s in the box?
Having decided on a super-set of the standard POSIX Shell Command Language to use the next thing was to look at the available facilities and work out how they could be used. Some of the facilities (or lack of) that are of note are mentioned below.
First off flow control contructs - if
, elif
, else
, case
, while
,
until
, for
are supported, as are user-defined functions. However,
returning values from a direct function call (one that does not start a
sub-command shell in a separate process) is limited to the numeric exit
status - although the boolean like true
and false
can be used for
predicate functions that can be called and tested in flow control constructs
such as if
or while
. This is because true
and false
are built-in
utilities that return with an exit code of 0 (success) or a non-zero value
(failure) respectively.
To return a copy of an expression, as is more familiar, a user-defined function can be called using Command Substitution which executes the function in a separate, presumably forked, sub-shell process. The output to stdout from a Command Substitution call are captured and effectively become the function call’s return value. Note that Command Substitution is also commonly used to catpure the output of commands into variables. However creating and destroying the sub-shell around the function call is expensive.
Variables store string values. Sometimes these can be interpreted as numbers. There are some limited operations that can be performed on a variable’s string value:
-
Various error or default/alternative value substitution actions around values being null, set and not null or unset.
-
Character length of the string value of a variable.
-
Removal from a variable’s value of a pattern matching the smallest or largest prefix or suffix of the variable’s value.
Notably, as far as I could see, there is no facility to more directly reference variable values' characters or substrings other than the prefix/suffix pattern match removal operations.
Arithmetic can be performed on numeric values using Arithmetic Expansion.
Facilities often used on the command line such as redirection (>
, <
, etc.),
pipelines (|
), and-or lists (&&
, ||
) , asynchronous execution (&
) and
so on are supported.
There are a set of special variables. Of particularly interest are $@
and
$#
for the list of parameters passed to the script from the command line or
to a function call and the number of such parameters, along with the set
,
unset
and shift
built-in utilities that allow control and processing of
script and function call parameter lists.
Script source files can be included in other script source files using the
dot (.
) built-in utility. The bash style source
synonym for dot is not
supported.
The shape of code to come
By now I had an idea of the general shape of the project. The first order of business would be to decide on a name and use it, or a form of it, as the basis for file names and, given the lack of namespaces, as a prefix to function names and the like.
The implementation code would be contained in a shell script file intended for
inclusion using .
, the dot built-in utility, in client scripts. Such
included files do not need to be executable themselves. Sometime later I
found out that there is a convention that such included shell script library
files should have a .sh suffix.
The public, client, API would consist of a set of functions, most of which would return string values. To make the use of these functions more familair they would be designed to be called using Command Substitution.
Any private internal implementation details - helper functions, global variables and the like - would have uglified names involving prefixed and suffixed double underscores.
Given that values are strings, the associative container instances would have to be represented as a string. Instance creation operations would have to return the string-instance value. Operations that do not modify an instance would take an instance as the first parameter to its implementation function. Operations that modify an instance would both take an intance as the first parameter and return the modified instance. Note that all parameter passing and value returning is effectively by value, so lots of potential copying.
What’s in a name
The first point of action was to decide a name from which the script library file name and library script identifier names can be derived.
The name I chose, following in the footsteps of Python, was dict.
The library file name was also originally just dict but later was changed to dict.sh, following the discovery of the previously mentioned convention.
What’s a dict to do?
Obviously dict instances would be of a dictionary type - also known as maps. These are associative containers whose entries are key:value value pairs that allow values to be looked up by their associated key. They can be ordered or unordered.
Given the limited options available with the facilities of POSIX Shell Command Language I decided it would be much simpler to opt for dicts being unordered.
The basic operations required would be:
-
Create dict instances.
-
Add and update entries in a dict.
-
Lookup and return a value in a dict given its associated key.
-
Remove an entry from a dict given the associated key.
Some additional operations also proved useful:
-
A function to return the size of the dict, being the number of entries.
-
A predicate to check whether some value, which of course is a string, appears to be a string formatted as a dict.
-
A for-each operation that calls a function for each entry in a dict.
-
A customisable pretty-print function to pretty print a dict in a user-controlled fashion.
-
To aid debugging a function to print a dict string with special unprintable characters translated to printable characters.
One capability that needed to be added explicitly is the ability to nest one dict as a value of another, preferrably to an arbitrary depth.
The dict data format
As the only data type to play with is the string and with no true random access operations into a string the format of a dict was going to have to be sequential, with an appearence of random access only - think tape access rather than disk. This sort of thinking lead to vague memories of some of the lesser used control codes in the ASCII (and thereby the Unicode®) character set. The ones of interest here are the separator control codes. Checking a convenient ASCII table source [AsciiTable] we see these are:
-
FS (value hex 1C) - File Separator
-
GS (value hex 1D) - Group Separator
-
RS (value hex 1E) - Record Separator
-
US (value hex 1F) - Unit Separator
On the one hand using these values to separate parts of a dict instance means keys and values cannot contain them. On the other hand for the intended use cases this should not be much of a problem. Where it would be an issue would be keys or values being binary data but this is not a primary use case. If storing binary data in a dict is required then the data can be encoded, for example with base64 encoding.
The current layout of a dict in Extended Backus Naur Form [EBNF] is as follows:
dict = header, { entry };
header = dict-type, entry-separator, version, field-separator,
size, entry-separator;
entry = key, field-separator, value, entry-separator;
dict-type = GS, "D", "i" "C", "t", GS;
version = digits, ".", digits, ".", digits;
size = digits;
key = string;
value = string | prepared-dict;
entry-separator = field-separator, record-separator;
field-separator = US;
record-separator = RS;
prepared-dict = ? a dict instance that has had its special separator
characters substituted so it can be safely used as
a nested dict value
?;
string = character, { character };
character = ? any valid character except ASCII FS, GS, RS, US ?;
digits = digit, { digit };
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
GS = "\0x1D";
RS = "\0x1E";
US = "\0x1F";
There are several things to note about the layout. The first is that a dict instance has two main parts: the header followed by zero or more entries.
The header consists of an initial dict identifier 'chunk' consisting of the four characters 'DiCt' surrounded by an ASCII Group Separator character on each side. This is followed by metadata fields for the version of the dict structure (currently '1.1.0') and the size of the dict being the number of entries, which starts at zero (an empty dict) and is updated as entries are added and removed. Note that each field is followed by an ASCII Unit Separator character, both in the header metadata record and within entries. The end of the metadata record is signified by an ASCII Record Separator character.
The following entries section consists of key:value pair entry records. As with the metadata record each key and each value is followed by an ASCII Unit Separator character, and each entry record is terminated with an ASCII Record Separator character.
Keys can be any string of characters except those used as separators - currently the ASCII GS, RS and US characters are used, but the FS - File Separator character might have a use in a later update, so I suggest it is best to not use any of the four ASCII separator characters.
The same restrictions apply to entry values, except that a dict can be nested as a value of another dict. In these cases the characters that could cause confusion are substituted for different sequences on entry with the reverse substitutions being applied on value extraction.
The dict API functions
The current set of dict API functions are:
-
dict_declare
to create a dict and optionally populate it with one or more initial entries. -
dict_set
to add or update one or more entries to a dict. -
dict_get
to return from a dict the value associated with the provided key, or an empty string if not found. -
dict_remove
to remove an entry from a dict given its key. -
dict_size
to return the number of entries as stored in a dict's metadata record’s size field. -
dict_is_dict
- a predicate function - to check if a value is a dict. -
dict_for_each
to iterate over the entries of a dict calling a function for each entry passing it the entry’s key, value, one based computed record index and any extra parameters passsed todict_for_each
. -
dict_count
returns the count of the number of records which should be the same as the value returned bydict_size
. It usesdict_for_each
and its computed record index to iteratively count the number of entries. The main use ofdict_count
is for testing. -
dict_pretty_print
prints the entries of a dict according to a caller provided format - itself specified by a dict. Usesdict_for_each
to recursively iterate over entries and nested dict entry values. This is a case where local variables are really required. -
dict_print_raw
to print the raw dict string with the unprintable separator characters translated to others. Mainly useful for debugging.
In addition there are simple versions of functions that insert or extract
values: dict_declare_simple
, dict_set_simple
and dict_get_simple
.
These function act as their non-simple versions except that passing dicts
as values is not supported. They assume entry values inserted or extracted are
not dict strings and thus elide the checks and special handling required
for dict values.
Finally there is a function that is intended to be used with dict_for_each
,
dict_op_to_var_flat
, that creates a variable for each entry in the dict
based on the entry’s key name. The 'flat' part of the name is intended to
indicate that the function does not recurse down into nested dict values.
Slice and splice
There are some basic private internal constants and operations to aid in building and updating dict strings. For example the initial empty dict is a fixed string as the only variable part is the size metadata field value and initially this is always 0.
Some small functions help out with chores such as ensuring keys are followed by the US field separator character and building an entry string by combining such a decorated key with a value and the US RS entry separator character sequence.
Locating entries is more interesting. The value associated with a key is
required as the result. First the header is stripped from the dict (remember
parameters are passed by value so in fact the header is stripped from a
copy of the dict) by using the remove-smallest-prefix operation
${dict#hdr-pattern}
, where hdr-pattern
is the fixed part of the header
including the entry-separator
with the size
field value wildcarded.
Next the rest of the dict string up to and including the required key field
is removed using the remove-smallest-prefix operation again, this time with
a pattern specifying all characters up to the preceding entry-separator
and
the decorated key value. This leaves the start of the dict substring copy
with the value associated with the key followed by all the subsequent entries.
Finally all these subsequent entries that follow the value are removed using
the remove-largest-suffix operation ${value_plus%%pattern}
with a pattern
specifying the entry-separator
and all following characters, leaving just
the value.
If removing the dict portion up to the required value did not remove anything then there was no pattern match meaning the key was not found in the dict. In this case an empty string is returned.
Add or update is similar except in this case if a key is found the prefix and suffix parts of the sliced up dict around the associated value are kept and the new value spliced in between them, the old value being discarded. In the case a key is not found a new entry is created and appended to the end of the dict.
Entry removal again consists of breaking the existing dict into parts and splicing parts back together. In this case the parts spliced together to form the result are a prefix part up to the key of the entry being removed and a suffix part from the start of the entry, if any, following the removed entry’s entry separator sequence.
A wrinkle in the above is the updating of the size metadata field. This is performed if required after the dict (copy) has been spliced back together and is done by stripping the header and then appending the resulting entry records to a newly built header with the updated size value using string concatenation of the various substring parts.
dict_for_each
first strips off the header of its passed dict copy then
repeatedly extracts the key and value of the initial entry record in the
resulting entry records and then removes the whole of the initial entry
record, until eventually there are no more entries to remove. On each
iteration a function whose name is passed as the second parameter to
dict_for_each
is called and passed the extracted key and value values and
a one-based computed record number along with any parameters passed to
dict_for_each
following the second function name parameter.
Being prepared
If the value of an entry to add or update is itself a dict then it has to be
modified before being inserted otherwise its entries can interfere with
operations on the outer containing dict. Likewise, if the value of an entry
located with dict_get
is a modified nested dict then it has to be modified
back to its original state before the value is returned.
dict_declare
and dict_set
check to see if a value to add or update is
a dict and if it is it is prepared for nesting. Likewise, if a value
extracted by dict_get
appears to be a prepared nested dict then it is
prepared for unnesting.
The prepare-for-nesting operation inserts an ASCII GS character before every
field-separator
and record-separator
- which are the ASCII US and RS
characters respectively. This prevents the matching that occurs with the
slicing and splicing of dict operations from matching any fields or entries
in the prepared for nesting dict value. For example when stripping away
entries before the value of an associated key the pattern used to match the
smallest prefix to remove is everything (*
) up to an entry-separator
followed by the decorated key, being the key followed by a field-separator
,
which for a key value of "KEY" expands to:
match = [string], US, RS, "K", "E", "Y", US;
If a nested dict value happened to also have an entry with a key value of "KEY", and this occured before the entry being searched for, if the nested dict entry had not been prepared this entry would be found. However after preparing for nesting the entry fragments that would have matched would have the following sequence:
prepared-averted-match = GS, US, GS, RS, "K", "E", "Y", GS, US;
Which fails to match both as the prepared modified entry-separator
sequence
is now GS, US, GS, RS
and as if that were not enough the modified
decorated key has become "K", "E", "Y", GS, US
.
The prepare-for-unnesting reverses the effects of preparing for nesting. It
replaces all occurrences of GS,US
with US
and every occurrence of
GS,RS
with RS
.
These transformations work for multiple levels of nested dict values. Each
additional level of nesting causes an additonal GS
character to be inserted
before each US
and RS
character. So if a dict has the following sequence
of characters for an entry:
unnested-entry = "K", "E", "Y", US, "V", "A", "L", "U", "E", US, RS;
Then inserting the dict into another dict tranforms the entry character sequence like so:
nested-entry-level-1 = "K", "E", "Y", GS, US, "V", "A", "L", "U", "E",
GS, US, GS, RS;
If the second dict is then inserted into a third, the already prepared-for-nesting nested dict value is modified again when the second dict is prepared for nesting, and the entry sequence is further transformed thus:
nested-entry-level-2 = "K", "E", "Y", GS, GS, US, "V", "A", "L", "U", "E",
GS, GS, US, GS, GS, RS;
If the second dict nested in the third dict is looked up in the third
dict then its prepared separator sequences are transformed by the
prepare-for-unnesting operation effectively removing one GS
character before
each US
and RS
character, which for the initial dict, nested as a value
of the second nested dict, removes one of the two accumulated GS
characters before its US
and RS
characters, transforming the entry
sequence from nested-entry-level-2
back to nested-entry-level-1
. If the
dict nested in the second dict copy obtained from the third dict is then
looked up in the second dict copy its prepared separator sequences will
again be transformed by the prepare-for-unnesting operation, removing the
remaining GS
characters before its US
and RS
characters, transforming
the entry sequence from nested-entry-level-1
back to unnested-entry
.
Unforunately, the only way I could see to perform the required replacements
for the prepare-for-nesting and prepare-for-unnesting operations was to farm
the tasks out to sed
using two global substitutions per preparation
operation. This of course is quite a heavy weight afair.
Exemplary dictation
Enough of the waffle, let’s dive in an see how to use dict in practice. The dict.sh shell script library and the examples shown are available in the GitHub sh-utils repository [ShUtils].
So let us start with the obligatory Hello World example.
#!/bin/sh
. dict.sh
record="$(dict_declare_simple 'greeting' 'Hello'\
'who' 'World')"
echo "$(dict_get_simple "${record}" 'greeting'),\
$(dict_get_simple "${record}" 'who')!"
record="$(dict_set_simple "${record}"\
'greeting' 'Hi' 'who' 'Earth')"
greeting="$(dict_get_simple "${record}"\
'greeting')"
who="$(dict_get_simple "${record}" 'who')"
echo "${greeting}, ${who}!"
First the source of the dict shell script library is included using the dot
built-in. It is assumed that dict.sh is either on the PATH
or in the same
directory as the Hello World script.
Next a dict is created using dict_declare_simple
, as we are not nesting
any dict values. It can be seen that the parameters to dict_declare_simple
are values for initial entries to initialise the new dict with. They are
grouped pairwise: key1 value1 key2 value2 … keyN valueN. This is easy to
handle in Shell Command Language thanks to the $@
and $#
special
parameters and the shift
built-in. The returned dict string is assigned to
a variable named record
.
The whole greeting is then written to the console by calling dict_get_simple
to obtain the values associated with the greeting and who keys. The
returned values are expanded directly into the echo
'd string.
The dict is then updated by calling dict_set_simple
. Note the pattern of
passing the expanded record
variable as the first argument and receiving the
returned dict string back into the record
variable. If we wanted to keep
the original state of record
then the updated dict could be assigned to a
different variable. Once again note the pairwise grouping of entry keys and
values in the parameter list.
Finally the greeting and who values are obtained from the updated dict, this time assigning each value to its own variable, which are then written to the console.
Assuming the Hello World script file is called dict_hello_world
, we are in
a console session and the working directory is the directory containing
dict_hello_world
then executing ./dict_hello_world
should produce the
following results (not forgetting to ensure the script is executable):
Hello, World! Hi, Earth!
What is needed though is more colour! This can be achieved by using ANSI terminal control sequences [AnsiTermCodes]. In the next example the Hello World example is extended to add keys foreground and background which have nested dict values containing keys r, g and b to store 24-bit (8 bit + 8 bit + 8 bit) RGB colour value triples which are used to set the foreground and background colours of the output greeting text using 24-bit colour mode ANSI terminal control escape codes.
#!/bin/sh
. dict.sh
readonly ASCII_ESC=$(echo '\033')
readonly ANSI_CMD_PREFIX="${ASCII_ESC}["
readonly ANSI_CMD_END="2m"
readonly ANSI_CMD_RESET="${ANSI_CMD_PREFIX}0m"
readonly \
ANSI_CMD_FG_MULTIBIT="${ANSI_CMD_PREFIX}38"
readonly \
ANSI_CMD_BG_MULTIBIT="${ANSI_CMD_PREFIX}48"
readonly \
ANSI_CMD_FG_24BIT="${ANSI_CMD_FG_MULTIBIT};2"
readonly \
ANSI_CMD_BG_24BIT="${ANSI_CMD_BG_MULTIBIT};2"
set_ansi_24bit_colours_string() {
local fg="${1}"
local bg="${2}"
local r="$(dict_get_simple "${fg}" 'r')"
local g="$(dict_get_simple "${fg}" 'g')"
local b="$(dict_get_simple "${fg}" 'b')"
local fg_cmd="${ANSI_CMD_FG_24BIT};${r};${g};\
${b}${ANSI_CMD_END}"
r="$(dict_get_simple "${bg}" 'r')"
g="$(dict_get_simple "${bg}" 'g')"
b="$(dict_get_simple "${bg}" 'b')"
return_value="${fg_cmd}${ANSI_CMD_BG_24BIT};\
${r};${g};${b}${ANSI_CMD_END}"
}
record="$(dict_declare \
'greeting' 'Hello' 'who' 'World' \
'foreground' "$(dict_declare_simple 'r' '127' \
'g' '255' 'b' '80' )" \
'background' "$(dict_declare_simple 'r' '80' \
'g' '0' 'b' '0' )" \
)"
set_ansi_24bit_colours_string \
"$(dict_get "${record}" 'foreground')" \
"$(dict_get "${record}" 'background')"
echo -n "${return_value}"
echo -n "$(dict_get_simple "${record}"\
'greeting'),\
$(dict_get_simple "${record}" 'who')!"
echo "${ANSI_CMD_RESET}"
record="$(dict_set_simple "${record}" \
'greeting' 'Hi' 'who' 'Earth')"
record="$(dict_set "${record}" \
'foreground' "$(dict_declare_simple 'r' '255' \
'g' '127' 'b' '80' )" \
'background' "$(dict_declare_simple 'r' '0' \
'g' '80' 'b' '0')" \
)"
fore="$(dict_get "${record}" 'foreground')"
back="$(dict_get "${record}" 'background')"
set_ansi_24bit_colours_string "${fore}" "${back}"
greeting="$(dict_get_simple "${record}" \
'greeting')"
who="$(dict_get_simple "${record}" 'who')"
echo "${return_value}${greeting},\
${who}!${ANSI_CMD_RESET}"
After including the dict.sh source as before there is support for setting
the ANSI terminal foreground and background 24-bit colours consisting of a
bunch of constants (readonly
variables) that build up the ANSI terminal
control escape codes and the set_ansi_24bit_colours_string
function that
takes dicts having r, g and b keys whose values specify the individual
8-bit components of the 24-bit colours for the foreground and background.
The set_ansi_24bit_colours_string
result is returned in the return_value
global variable and is a string that will set the requested 24-bit colours
when written to a supporting ANSI terminal such Xterm, GNOME terminal or KDE’s
Konsole (as listed in the 24-bit section of [AnsiTermCodes]).
As previously a dict is created. However this time dict_declare
must be
used as some of the initial entry values are dicts. The nested dict
values for the foreground and background entries are created by calling
dict_declare_simple
as none of the nested dicts' entries' values are
themselves dicts.
set_ansi_24bit_colours_string
is called before outputting the complete
greeting, and is passed the foreground and background RGB triple dicts
which were obtained using dict_get
(as they are nested dict values) from
the record
dict and stored in variables fore
and back
.
The string returned by set_ansi_24bit_colours_string
in return_value
is
echo
'd to the console without a terminating newline to set the terminal’s
foreground and background colours and then the greeting is output as before
except no terminating newline is output as following the greeting the
ANSI_CMD_RESET
constant string is output to reset the terminal, removing the
previously set foreground and background colours.
The greeting and who entries of record
are updated as before with
dict_set_simple
. Then the foreground and background RGB triple
dict values are updated using dict_set
. Note that all four entries could
have been updated in a single call to dict_get
. However, doing it in two
parts demonstrates that a dict containing nested dict values can still be
operated on by the simple forms of the API functions so long as no dict
entry values are involved in that specific function call.
Once more the (updated) foreground and background RGB triple dicts are
passed to set_ansi_24bit_colours_string
to obtain the new set-colours ANSI
control escape code string. The updated values for the greeting and who
entries are, as before, obtained with calls to dict_get_simple
and stored in
variables greeting
and who
.
Finally, the whole output sequence: set-colour-control-escape-codes,
greeting
, who
, ANSI reset control-escape-code is output in a single
echo
command.
Executing the extended Hello World example should produce the same textual output, only with more colour involved.
Other tricks
There are a few things that you can do with dicts that might be of interest. First dicts are strings. Yes I know that has been stated a time or two already. However, as strings they can be used like any other string, although of course just printing them out looses something in the process - the unprintable separator characters for a start.
More usefully they can be passed to commands and other scripts as arguments. This is useful if, for example, you have a suite of mutli-stage scripts that run one after the other and it is the intial script’s job to obtain the arguments and options for the job - they are collected in a dict which is passed from one script to the next, each accessing the arguments and options relevant to itself.
Of course as strings they can also be saved to disk and read back later, or sent over a network. Basically dicts are already serialised. The only issue in the future would be with regard to possible differing versions of the dict data format, at which time the version metadata field in the dict header would become a lot more relevant.
As an associative key:value container it is possible to use dicts to emulate other container types. Two that come to mind are vectors (that is single dimension arrays) and sets.
The characteristics of a vector are that values are identified by an integer index, usually starting at 0 or 1 and increasing by one for each entry. Another characteristic of some implementations is that they can only have elements efficiently added to the end of the vector.
A dict can emulate a vector which for example has elements starting at index 0 and only allows appending elements to the end. The idea would be to wrap the underlying dict operation function calls in vector specific operation functions.
Operations that create new elements do not require the index key values as
parameters. Rather they synthesise the next index as being the value returned
by dict_size
. Hence an empty vector-dict would have size zero and so the
index key for the first element would be '0', the size is then one thus the
index of the next added element would be '1' and so on. Operations that add
new elements would be created with initial element values and append new
element values, which would be wrappers around dict_declare
(or
dict_declare_simple
) and dict_set
(or dict_set_simple
). A
vector_append
function might look as below:
vector_append() {
local vector="${1}"
shift
local count=$#
local index=$(dict_size "${vector}")
while [ ${count} -gt 0 ]; do
local value="${1}"
shift
set -- "$@" "${index}" "${value}"
count=$(( ${count}-1 ))
index=$(( ${index}+1 ))
done
# pass the vector and converted entry values
# to dict_set:
vector_return_value="$(dict_set "${vector}" "$@")"
}
Note that the values to be added have to be pairwise interleaved with their
index key values which requires modifying the $@
special parameter while
iterating over it which is why a snapshot of the value of $#
is taken before
the iteration starts which is then manually decremented. Similarly a snapshot
of the intitial size of the vector is taken before iteration start which is
incremented manually - this is to reduce the overhead of calling dict_size
.
Other operations that could be useful would be accessing a specific element
by index key, which would simply devolve to a call to dict_get
(or
dict_get_simple
), and updating an existing element, which would basically
wrap a call to dict_set
(or dict_set_simple
) with checks on the index
value to ensure it is a valid and in range index value.
Iteration can of course be performed via dict_for_each
, possibly via a
wrapper that adapts the function called if not all of the usual three
arguments dict_for_each
passes to the supplied function for each entry (key,
value and computed one based record number) are not required.
Emulating a set is similar in that both keys and values do not need to be
specified when adding set members. In this case it is only the key values
that need to be given, the associated values are of no interest other than
them not being empty so that set membership can be determined by passing the
cadidate value to dict_get_simple
(or dict_get
) and testing the result to
see if it is not empty. The entry values can all be the same - preferably a
short value such as '_'.
As with the vector emulation case, functions can be written that wrap the underlying dict operations. For example a predicate function to check if a set-dict contains a value might look as follows:
set_contains() {
local set="${1}" # the haystack to search
local value="${2}" # the needle to find
local maybe_in="$(dict_get_simple \
"${set}" "${value}")"
if [ -n "${maybe_in}" ]; then
true; return
else
false; return
fi
}
Conclusions and possible future directions
The dict.sh Shell Command Language library script implements a
dictionary container using mostly just standard features of the POSIX
Shell Command Language plus simple use of a few core utilities, with the
exception of local
and echo -n
.
However, performance is a concern. The call-out to sed
to perform the
required substitutions when inserting or retrieving nested dicts is
particularly heavy on performance.
The use of Command Substitution is also an area that drags down
performance. In fact internally the private, uglyfied support functions are
now all called directly and return their results by setting a
__dict_return_value__
global variable.
I have a couple of ideas to potentially address these two issues.
The first is to re-work how dicts are nested that removes the need for
changing the data format and thus removes the dependence on sed
. The idea
- very nebulous at the moment - would be to give each dict an identifier,
which might be random and may need to be globally unique, and bolt nested
dicts onto the end of the dict string with some sort of new separator
sequence - maybe using the currently unused FS character. Entries that have
nested dict values would have values that reference that dict by
identifier.
The second idea is to add a parallel set of API functions that are called
directly rather than by Command Substitution and return their value in a
variable specified by name by the caller as an additional parameter by making
used of the eval
built-in utility. These would be more cumbersome to use but
eleminate having to spin up a sub-shell process for every call.
However at the end of the day there is only going to be so much that can be done to reduce performance overheads. The underlying sequential nature of strings and the complexity required by the slice and splice operations will end up limiting performance. dicts are never going to have any great performance, especially at scale. However they were not intended to.
The reliance on echo -n
for the existing API functions can be fixed by using
standard facilties of the printf
utility.
Some additional operations would be useful. One that springs to mind is a
dict_merge
or dict_extend
operation that combines the elements of two
(or more?) dicts.
Rather than emulating other container types it would be cleaner to implement each as their own library script, tayloring the data structure and operations to suit each.
I do not know if or when I will make these changes as I had already gone down a few rabbit holes while getting on with a larger task when I started implementing dict. It would be good to bottom out and pop back up out of some of the rabbit holes.
References
[AsciiTable] ASCII Table https://www.asciitable.com/
[AnsiTermCodes] ANSI terminal control codes https://en.wikipedia.org/wiki/ANSI_escape_code
[EBNF] Extended Backus Naur Form https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form
[OpenGroup] The Open Group https://www.opengroup.org/
[ShellLang] Shell Command Language https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html
[ShellLangRationale] Rationale for Shell and Utilities, Shell Command Language, Shell Commands, Function Definition Command, p.2 https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xcu_chap02.html
[ShUtils] sh-utils repository https://github.com/ralph-mcardell/sh-utils