How to parse fixed-length data and why you should avoid 'String.substring'

Β·

4 min read

Introduction

This article is about exchanging data with a fixed-length data format. It will tell you about the pros and cons of this data format. It shows and demonstrates an implementation in Dart that is easy to read and maintain. The implementation supports characters that consist of more than one code unit like e.g. emojis. It will also show that the standard functions like String.length and String.substring may fail on emojis.

Pros and Cons of the fixed-length data format

Data can be interchanged between systems in many different formats. The most well known format nowadays is json, but other popular formats are xml, csv and fixed-length.

This article is about the fixed-length data format. It has some advantages over the other ones.

  • No need to load all data into memory before the data can be used. This is especially useful when importing large datasets. The data can be read and processed in chunks.
  • No need to use escape characters, like you have to do with csv files. With csv files, there is always a problem when you want to use the character in the data that is also used to separate the values.

Of course, there are also downsides to using a fixed-length data format.

  • The sender and the receiver have to agree on the order of values and length of each value.
  • Each field has to be padded with trailing spaces or leading zeroes.
  • Each field is (obviously) fixed in length. An increase in length would need work on both the source and destination.

Data definition

It is important to document the format, so the source and destination are both aware of the format used. A simple document could look like this:

FieldTypeLengthPadding
first_namechar10Right with spaces
last_namechar10Right with spaces
ageinteger3Left with zeroes
citychar15Right with spaces
countrychar20Right with spaces

Sample data

----------------------------------------------------------
1234567890123456789012312345678901234512345678901234567890
Sander    Roest     049Rotterdam      The Netherlands     
Sandra    Roest     042Rotterdam      The Netherlands     
Jeffrey   Roest     009Rotterdam      The Netherlands     
Lucas     Roest     007Rotterdam      The Netherlands     
----------------------------------------------------------

Implementation in Dart

Writing the code to parse fixed-length data seems like an easy job. At first, it looks like you just have to substring all the fields out of the data. This is in fact true, but it might get messy and difficult to maintain when the data definition changes.

Another issue to consider is that the String.length and String.substring functions might not work in the way you think.

The String class works with code units. This means that you will get the length of a string in code units and not characters.

You can read all about it in this excellent post Dart string manipulation done right πŸ‘‰.

To overcome both problems, you can use this helper class:

class FixedLengthParser {
  FixedLengthParser(String value) : _characters = value.characters;

  final Characters _characters;
  var _index = 0;

  String getByLength(int length) {
    var value = _characters.getRange(_index, _index + length);
    _index += length;
    return value.string.trim();
  }
}

The usage of this class makes it easy to match the code with the data definition. If the length of a field changes, you will have to change it only in one place.

final parser = FixedLengthParser(line);
final firstName = parser.getByLength(10);
final lastName = parser.getByLength(10);
final age = parser.getByLength(3);
final city = parser.getByLength(15);
final country = parser.getByLength(20);

Dartpad sample

With the dartpad sample you will be able to:

  • Test the FixedLengthParser class (renamed to FixedLengthParserCharacters)
  • Observe that String.substring fails on Emojis
  • See a fun usage of the new constructor-tear-off functionality in Dart to use the same code with a working (characters) and a failing (string) implementation.

https://dartpad.dev/?id=f4e6612c9f3ef3b5eb3e439923fe8a46

The output looks like this where you can see that the String implementation fails on emojis.

Parse the data using characters:
-------------------------------------------------
Firstname (10): 1234567890
Lastname (10): 1234567890
Age (3): 123
City (15): 123456789012345
Country (20): 12345678901234567890

Firstname (6): Sander
Lastname (5): Roest
Age (3): 049
City (9): Rotterdam
Country (15): The Netherlands

Firstname (10): πŸ₯‡πŸ₯‡πŸ₯‡πŸ₯‡πŸ₯‡πŸ₯‡πŸ₯‡πŸ₯‡πŸ₯‡πŸ₯‡
Lastname (10): πŸ₯ˆπŸ₯ˆπŸ₯ˆπŸ₯ˆπŸ₯ˆπŸ₯ˆπŸ₯ˆπŸ₯ˆπŸ₯ˆπŸ₯ˆ
Age (3): πŸŽ‚πŸŽ‚πŸŽ‚
City (15): 🏘🏘🏘🏘🏘🏘🏘🏘🏘🏘🏘🏘🏘🏘🏘
Country (20): πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±πŸ‡³πŸ‡±

Parse the data using string (faulty):
-------------------------------------
Firstname (10): 1234567890
Lastname (10): 1234567890
Age (3): 123
City (15): 123456789012345
Country (20): 12345678901234567890

Firstname (6): Sander
Lastname (5): Roest
Age (3): 049
City (9): Rotterdam
Country (15): The Netherlands

Firstname (5): πŸ₯‡πŸ₯‡πŸ₯‡πŸ₯‡πŸ₯‡
Lastname (5): πŸ₯‡πŸ₯‡πŸ₯‡πŸ₯‡πŸ₯‡
Age (2): πŸ₯ˆοΏ½
City (8): οΏ½πŸ₯ˆπŸ₯ˆπŸ₯ˆπŸ₯ˆπŸ₯ˆπŸ₯ˆπŸ₯ˆ
Country (10): πŸ₯ˆπŸŽ‚πŸŽ‚πŸŽ‚πŸ˜πŸ˜πŸ˜πŸ˜πŸ˜πŸ˜

Happy parsing!

Β