Aug 17 2020 / Michaël Hompus

YAML properties as Scalar or Sequence? Or both?!

YAML is a data serialization standard that is intended to be human friendly. For example, it reduces the use of delimiters quite drastically compared to other formats like JSON. Some YAML file authors might push the boundary for readability even further by having a property where the value can be a sequence with zero or more values, or just a scalar if there is only a single value. Reducing the number of delimiters someone has to read or write even further. How can this work with C#, a strongly typed language?

Full article…

YAML is a data serialization standard that is intended to be human friendly. For example, it reduces the use of delimiters quite drastically compared to other formats like JSON.

Some YAML file authors might push the boundary for readability even further by having a property where the value can be a sequence with zero or more values, or just a scalar if there is only a single value. Reducing the number of delimiters someone must read or write even further.

How can this work with C#, a strongly typed language?

A small introduction to YAML data structures

The YAML standard describes three basic primitives for data structures:

mappings (hashes/dictionaries)
sequences (arrays/lists)
scalars (strings/numbers)

In this article, we will focus only on sequences and scalars.

Scalars

To set the value of a property using a scalar, in this case a string, you write:

dependsOn: previousStage1

Sequences

There are two ways to write a sequence in a YAML file.

In block style, using a dash and space to form a bulleted list, putting every entry on a separate line:

dependsOn:
  - previousStage1
  - previousStage2

In flow style, using square brackets as delimiters and a comma as separator:

dependsOn: [previousStage1, previousStage2]

Both versions are holding the same data and are interchangeable. It is up to the author which style is preferred, based on readability and context.

Pushing readability even further

Some YAML file authors might push the boundary for readability even further. Take for example the YAML schema for Azure Pipelines. If we look at the Stage or Job structures, they both have a dependsOn property that can be a string, or a sequence of string.

stages:
- stage: string # Required as first property. ID of the stage.
  displayName: string # Human-readable name for the stage.
  dependsOn: string | [ string ] # Any stages which must complete before this one.
  …

jobs:
- job: string # Required as first property. ID of the job.
  displayName: string # Human-readable name for the job.
  dependsOn: string | [ string ] # Any jobs which must complete before this one.
  …

If the author of an Azure pipeline has only a single value for the dependsOn property, they do not need to add all the delimiters to the value. Stating only the single value as a string is good enough.

dependsOn: previousStage1

# equals
dependsOn:
- previousStage1

# equals
dependsOn: [ previousStage1 ]

Typed Languages

This is brief, and nice for readability. But if we want to parse this YAML document in a typed language like C#, we will get into trouble as there is no datatype that can be a single string and a sequence of string at the same time.

If we create a class that uses a string for storing the value, it will not work if we supply multiple values. But what if we use an enumerable data type? It is not a problem if we only store a single value in a list.

So, how can this work?

For the code, the YamlDotNet library is used again.

First, add a class to stand for the deserialized data, we will use the Azure pipeline Stage as an example.

public class Stage
{
  public IEnumerable<string> DependsOn { get; set; }
}

Create a deserializer, and feed it the sequence example mentioned earlier.

var deserializer = new DeserializerBuilder()
    .WithNamingConvention(CamelCaseNamingConvention.Instance)
    .Build();

var stage = deserializer.Deserialize<Stage>("dependsOn: [ previousStage1, previousStage2 ]");

// stage.DependsOn
// Count = 2
//     [0]: "previousStage1"
//     [1]: "previousStage2"

That is nice and easy.
What happens when we feed it with a single value?

var stage = deserializer.Deserialize<Stage>("dependsOn: previousStage1");

// YamlDotNet.Core.YamlException: '(Line: 1, Col: 12, Idx: 11) - (Line: 1, Col: 26, Idx: 25):
// Exception during deserialization
// Inner Exception
// 'InvalidCastException: Invalid cast from 'System.String' to 'System.Collections.Generic.List`1[[System.String, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]]'.

Okay, we need to help YamlDotNet a bit here, let us add some magic.

The YAML Type Converter

So, how to convince YamlDotNet that if an enumerable string is used as a data type and a single string value is stored in the YAML document, we want to add as the sole value to the list.

For this we add a converter class that inherits from the IYamlTypeConverter interface. Sadly, there is not much documentation about YAML Type Converters, but in summary, it uses events to read or write to a YAML stream.

Let us implement the interface and state in the Accepts method that we can handle enumerable string data types.

public class ScalarOrSequenceConverter : IYamlTypeConverter
{
  public bool Accepts(Type type)
  {
    return typeof(IEnumerable<string>).IsAssignableFrom(type);
  }

  public object ReadYaml(IParser parser, Type type)
  {
    throw new NotImplementedException();
  }

  public void WriteYaml(IEmitter emitter, object value, Type type)
  {
    throw new NotImplementedException();
  }
}

In the ReadYaml method we will implement the logic of reading from the YAML stream.

We first try if the current event is a Scalar event using the TryConsume method, if that is successful, we return a list with a single entry.

public object ReadYaml(IParser parser, Type type)
{
  if (parser.TryConsume<Scalar>(out var scalar))
  {
    return new List<string> { scalar.Value };
  }
}

If the source is a sequence, this will be visible by a SequenceStart event.
For every entry we will get a Scalar event.
And when there are no more entries, we handle the SequenceEnd event.
Now, we can return the list with all items.

if (parser.TryConsume<SequenceStart>(out var _))
{
  var items = new List<string>();

  while (parser.TryConsume<Scalar>(out var scalarItem))
  {
    items.Add(scalarItem.Value);
  }

  parser.Consume<SequenceEnd>();

  return items;
}

If the data was not a scalar or a sequence, we will return an empty list.

return Enumerable.Empty<string>();

We need to register this converter with the deserializer.

var deserializer = new DeserializerBuilder()
    .WithTypeConverter(new ScalarOrSequenceConverter())
    .WithNamingConvention(CamelCaseNamingConvention.Instance)
    .Build();

If we run the failed code again, we see it works:

var stage = deserializer.Deserialize<Stage>("dependsOn: previousStage1");

// stage.DependsOn
// Count = 1
//     [0]: "previousStage1"

Writing YAML

So, with the reading part implemented, could we also mimic the same behavior when writing YAML?

Start with the creation of a Serializer.

var serializer = new SerializerBuilder()
    .WithNamingConvention(CamelCaseNamingConvention.Instance)
    .Build();

Then, create an instance of the Stage class and give it a single value. This returns a sequence with a single item, not very surprising.

var stage = new Stage { DependsOn = new[] { "previousStage1" } };
serializer.Serialize(Console.Out, stage);

// dependsOn:
// - previousStage1

If we want to change the behavior of the serializer, we add the type converter to the configuration.

var serializer = new SerializerBuilder()
    .WithTypeConverter(new ScalarOrSequenceConverter())
    .WithNamingConvention(CamelCaseNamingConvention.Instance)
    .Build();

Add the logic to the WriteYaml method. We cast the object to an enumerable of string.
If it has exactly one item, we emit a Scalar event holding the value.
Otherwise, we will emit a SequenceStart event, a Scalar event for every item in the list and finish with a SequenceEnd event to close the sequence.

public void WriteYaml(IEmitter emitter, object value, Type type)
{
  var sequence = (IEnumerable<string>)value;
  if (sequence.Count() == 1)
  {
    emitter.Emit(new Scalar(default, sequence.First()));
  }
  else
  {
    emitter.Emit(new SequenceStart(default, default, false, SequenceStyle.Any));

    foreach (var item in sequence)
    {
      emitter.Emit(new Scalar(default, item));
    }

    emitter.Emit(new SequenceEnd());
  }
}

If we now repeat the earlier example, we see the output is as expected.

var stage = new Stage { DependsOn = new[] { "previousStage1" } };
serializer.Serialize(Console.Out, stage);

// dependsOn: previousStage1

And if we have multiple items, it is still written as a sequence.

var stage = new Stage { DependsOn = new[] { "previousStage1", "previousStage2" } };
serializer.Serialize(Console.Out, stage);

// dependsOn:
// - previousStage1
// - previousStage2

If the list is empty, the sequence will still have a start and end event.
An empty list is written in flow style.

var stage = new Stage { DependsOn = new string[0] };
serializer.Serialize(Console.Out, stage);

// dependsOn: []

Conclusion

It is possible to allow users more flexibility if you expect sequences often to contain a single value. And this can be done without losing the possibility to use typed languages like C# to parse it.

The code used in this article is shared on GitHub.