Tuesday, June 27, 2017

C++11 and some efficiency improvements


I recently learnt a few new things.

shared_ptr


First is that using std::make_shared is actually not only good because of it's better exception safety (the primary reason I hear people promoting it), but it is actually better because it is infact more efficient and saves space.

To understand this, first lets explain what a shared_ptr is. It is basically a wrapper around a raw pointer. What the wrapper provides is reference counting. So a shared_ptr is two things, the raw pointer and a reference count associated with that raw pointer. Actually this is a slight simplification, because of weak pointers, the reference count is two counts, a weak count and a strong count, and both these are put in a structure called the control block for the shared_ptr, but conceptually this is the reference count part of it (in fact there is a bit more than this too, including the deleter etc, and this gets hidden via type eraser, but that is far more detail than is needed for discussion here). So a shared_ptr is two pointers, one to the object being managed, and another to the reference count (or control block). The reference count needs to be a pointer so that multiple shared_ptr instances can be referring and updating the one shared reference count associated with the object. Because shared_ptr has a constructor that takes a raw pointer, when constructed this way, there is only one choice and that is to dynamically allocate the reference count. Hence two dynamic allocations with this strategy, one for the allocation of the object that is passed in this way, and a second for the reference count. More allocations can mean memory fragmentation. What std::make_shared does is to essentially create a struct of T and reference count together and allocate this with one allocation. So one less dynamic memory allocation. Makes the deletion simpler too. And so that is why using make_shared to create a shared_ptr is better.

Whilst thinking about this, I realised a few things too. Firstly, unqiue_ptr obviously doesn't require a reference count and therefore doesn't require a second dynamic memory allocation ever. I did always scratch my head as to why in C++11 there was make_shared but no make_unique (although added later), but it kind of makes sense that is doesn't add quite as much, the only thing is the exception safety which to be honest I don't lose sleep over as much as the efficiency of these things in the common case compared with the exceptional cases. The provision of make_shared is far more valuable that adding make_unique from this point of view. And thinking about it even more, whist I do know that implementations are in fact doing this make_shared optimisation, another optimisation I'm not so sure about which can be done is that in the case make_shared is not used, that second dynamic memory allocation for the reference count / control block, could be deferred until time as there is a second reference. This practically makes the cost of shared_ptr equal to that of unique_ptr when they are used equivalently. You would obviously for semantic reasons choose appropriately, but I think there is little need to chose from a performance point of view.

So the main take away should be always use make_shared to construct your shared_ptrs when you can, there is no extra dynamic memory allocation, the cost is that of allocating a minimal bit of extra house keeping information together with your object (assuming your object is not of some trivial size like < 32 bytes, consider making a copy of the object instead when it is small, obviously consider your individual situation when evaluating if shared_ptr is right for you). For that second dynamic memory allocation when it is needed, I have imagined there could possibly be schemes in which it is possible to do away with it with a more complex strategy. The idea I had was something like a doubly linked list between active shared_ptr objects to a shared object. As a shared_ptr object goes out of scope, it removes itself from the doubly linked list, patching the other ones to keep it all chained together. The last one removed will know it is because it's prev and next pointers will become null. I'm not sure if anyone has implemented a reference counting scheme doing this. It obviously gets more complex when dealing with weak and strong references, and also it much more maintenance work making this more complex and more likely to break and also isn't so nice from a memory access  pattern point of view, jumping about too much.

Final

The next thing I learnt recently is the keyword in C++11 'final'. It is possible similar to in Java to be able to mark implementations of interfaces as final. You can do this on the class or on a specific virtual member function. What the semantics of this means is that you can not override it further in the inheritance hierarchy. When a class is marked final, it can not be inherited from, and when a member function is marked final, a subclass can not override it. This is great from a semantics point of view, you can protect an implementation from being overridden as you are marking it as intended to be the final and last in that chain of inheritance, this lets you write it in a way that you don't need to design for unintended uses. But really the thing that makes me interested about final isn't so much being able to express some kind of semantic intent in your implementation, whilst that is important and cool, what I like is that it allows the compiler to do some things to make the code more efficient, and that is actually how I came across this, I theorised that when you have virtual functions and a vtable, you wouldn't need to do a virtual dispatch if you had knowledge that in the context of the pointer you have and the function being called that it wasn't further derived. And from this realisation that I thought that a final keyword actually would mean that you could make things more efficient that I searched about this and found that not only did C++11 add this, that in fact from more research that implementations are doing this optimisation already and avoid the virtual dispatch when final is used and your pointer is of the type where final was declared.

What does it mean from a practical point of view? Well it may change they way you might structure and write code involving virtuals. Firstly you will want to start decorating your classes with final. That is the first step. Next you will want to defer casting to a base pointer as late as possible, but also after you have done so, what you may then prefer to do after this if doing multiple dispatches with a given object, you might find that it will be faster to dynamically cast the object to the derived cast it is and do the multiple virtual calls with this pointer than to use the base pointer to speed up special fast path cases. I don't think you would have been able to as easily sped up a fast path like this without removing the general purpose virtual functions or without more intrusive changes.

String literals

Something that annoys me about C/C++ is that there is no distinction between literal objects and other objects in the types for them. For example take string literals, such as "blah" in the code, this is a string of bytes, 'b', 'l', 'a', 'h', 0 which is const, and exists in the statically initialised section of the program and exists for the duration of the execution. But generally its type is "const char *". If I new an array of chars, I'll get a "char *" which is readily convertible to a "const char *". This is mutable. It exists only as long as I say it exists. It is quite a different beast to a string literal. What I really want is to be able to provide APIs which can distinguish between these and do something different depending on how it was called, either with an immutable string literal or with a mutable array or chars. Sometimes you want to know that it is a string literal so you don't need to make a copy so can make an optimisation. Other times it can be for better security so you have tighter control over what and where something can be passed in to a function.

So how can it be done. It can be done using templates, the only downside is that generally this will expose the implementation to the header, although this could just be used as a shim.

For example:

template <size_t N>
void DoSomethingWithLiteral(const char (&a_str)[N])
{
  // a_str is a string literal
}

Basically that is it. A function that looks like that will only accept string literals. Beware that providing other functions with the same name that the overloading doesn't resolve to the call with a string literal ending up going to the overload instead of the intended one. Obviously always test, and then more template magic can be used to disambiguate the calls, that is doable now.

A few more things about string literals.

String literal operators. New in C++11 is user defined literal operator overloads. This is neat and basically allows for defining units to scalar values in a natural way. In English when talking about values in certain units, the way it is expressed is such as 50kmh, or 10kg, or 5s or 10days etc. The pattern is a scalar value (e.g. integer value) followed by a suffix which is an abbreviation for the units. So in traditional C++ the typical way would be to make a type, for example a type called kilograms or kg, e.g.: class kg {}; and you could make a constructor that took an int (or float) so then you could express it something like:  kg(50);  And could do "kg(50) + kg(10);" etc. And could provide conversion operators to convert from the type kg to the type g so you could do "kg(1) + g(500)" etc. But this is not so natural. a literal operator essentially lets you define a conversion operator from int to your type but expressed like 1kg or 500g, so you can write instead "1kg + 500g". It kind of reminds me a bit of python how you can express things like:  "some string".split()  where you can call methods on literals. And in C++ it is also not limited to ints, you can do this for string literals too, so could convert a string literal to a std::string for example with something like "blah"_s.

One more thing about string literals.

I wrote about logging not long ago here: Avoiding heap allocation with strings

Something I realise that I didn't mention is that you can forward the arguments using templates better than what I demonstrated. The use of va_list isn't needed if you pass via templates. Obviously though the issue with that is then you need to expose everything in the header file, so the demonstrated method is fine if you want to hide the implementation details.

Here is a slightly improved way:

template <size_t N, typename ...Ts>
void Logger(enum LogLevel a_level, const char (&a_fmt)[N], Ts&& ...a_args)
{
  size_t msgSiz = std::snprintf(nullptr0, a_fmt, a_args...) + 1;
  char msg[msgSiz];
  ::snprintf(msg, msgSiz, a_fmt, a_args...);
   ...
}

You'll notice it is slightly less lines of code than the original. It perfectly forwards the arguments, and does so without all the fuss of using va_lists. Also another subtle change is the way in which a_fmt is passed in. Instead of being passed as a const char*, it is passed and preserved as a string literal. This has some benefits, security and efficiency. First is the compiler at compile time can make checks as to the arguments match with the format provided, it also forces the a_fmt that is passed to be a string literal so there is no possibility that at runtime a maliciously formed a_fmt string will be passed to this.

It may not be appropriate in the above case, but in some cases it is possible to forward declare a template without providing the body in the header and declaring the body of the template in the source file. In the above case there are too many varying parameters to the template to be able to instantiate all the required uses of it in a single cpp file. But there are cases where templates can be used where it is reasonable that each expansion of the template for each type can be anticipated and hidden but provided in a cpp file. This is generally preferable if it can be done, which leads to the next thing I learnt recently which is a good things to know.

Headers and forward declarations


I was reading "Effective C++" by Scott Meyers, the books are over 20 years old, but there is still value to be gained from them. Obviously some advice is dated and a reading of any older text book needs to be made in context, but sometimes there are things you just didn't realised and somehow missed, in fact it seems a lot of people have missed this particular point which I will elaborate on, as I have rarely seen this practised if ever.

It involves the forward declaration of types, and then defining functions in terms of them not just by pointer or reference, but also by value.

The example from the book goes something like this:

#pragma once     // Perhaps another blog post about using pragma once ?
#ifndef BIRTHDAY_H
#define BIRTHDAY_H

class Date;  // this is the forward declaration

Date getMyBirthday(); // return the date of my birthday
void setMyBirthday(Date a_date); // set the date of my birthday

#endif // BIRTHDAY_H


I didn't realise this was possible, but I tried it and it is indeed possible. I don't know why, but I thought that you would at least need to make the Date used in the function prototypes need to be references or pointers when you don't yet know the details of what the Date class is. But this is great that this actually works.

It works if you think about it, because you only need to know the details of what Date is when you go to implement those functions or when you go to call those functions, so you can defer inclusion of date.h until that time rather than forcing it in to the header.

I think that was my confusion, if not including date.h inside birthday.h, then I could neither implement birthday.cpp nor make use of the functions in birthday.h that depended on Date (say e.g. in main.cpp for sake of simplicity), and logically assumed that the fix was to include date.h inside of birthday.h. But that is the fallacy. That actually isn't the fix you are looking for. The fix you really want is to include date.h in both birthday.cpp and main.cpp. This seems to contravene the DRY principle (don't repeat yourself). But it is really what is better. As much as possible, you want to not include headers in your headers, it only leads to more headers being included and so on which leads to longer compile times. This chain of headers including headers means that 1000s of function prototypes are pulled in, when in fact in a given cpp file, it may only use 100s or less of external functions to itself. Being more explicit inside the cpp files of what is included is preferable to implicitly including much more from headers including other headers.

I think this is something I am going to try to adopt more if I can.

Some gotchas and possible solutions. Where you do need to know the details of an object and its size is when you use it as a member variable. e.g.:

class Date;
class Birthday {
public:
  int dayOfMonth();
  int month();
private:
  Date date;
  ... other members
};

Now this won't work. You can't figure out the sizeof(Birthday) without knowing the sizeof(Date). You can't do "new Birthday". I'm pretty sure this won't compile as is.

If you really don't want to include date.h or if Date was a bit more of a heavy weight object, a workaround would be to wrap it in a shared_ptr or unique_ptr, and this goes back to one of the original points in this blog about the efficiency of such.

Anyway, possibly exposing all the private member variables in a class is not great if you can avoid it, and if you do go the route of wrapping in unqiue_ptr or shared_ptr any of the members, why not then combine all the private members together in to a struct and wrap that entire struct in a smart pointer. This is essentially what is called a pimpl.  Once you do that, you can hide all of that implementation detail in the cpp file. It looks like this in the header:


#pragma once

class Birthday {
public:
  int dayOfMonth();
  int month();
private:
  struct Pimpl;
  std::shared_ptr<Pimpl>;
};


In the cpp file you can then do what ever you like:


#include "Birthday.h"

struct Birthday::Pimpl
{
   // put anything you like in here
   // change it as often and as much as you like
};


Make sure the Birthday constructor (not shown) uses make_shared for efficiency!

I might write more in another blog about this pattern of pimpl in a shared_ptr.


Wednesday, June 7, 2017

Array of Structures and Structures of Arrays


It can be time consuming and painful to recode data structures in a program to better improve efficiency by optimising the data layout to group data together which will be accessed together.

The goal is to improve cache utilisation. You don't want to iterate an array of structures where you are only accessing some of the members of each item. Say each item in the array is a struct of 32 bytes in size, but we only access 4 bytes of each one as we loop. We have loaded up our cache lines with loads of data we aren't even going to touch resulting in sub-optimal efficiency. An alternative approach in this scenario is to re-arrange the data in to a structure holding arrays for each member. When accessing the one member at a time when iterating, the cache will be used more efficiently as the cache lines will be filled up with only the data we are accessing. Alternatively re-writing the code to work on each item as a whole at a time might be an option, but it also might not be possible or be sub-optimal.

Conversely if the data is arranged as structures of arrays and when iterating multiple members are accessed at the same time, there might be related problems, but it is less likely. You still may want to try both approaches to see what works best for given data structures and their access patterns.

But it's not trivial in C/C++ to rearrange the data like this without having to re-write any code which deals with this data. You can't simply flick a switch and the compiler automagically figures out how to re-arrange the code generation to access the data in the alternative layout. Being able to do that would be pretty cool. Jonathan Blow claims in his Jai programming language to have the ability to do just that. Currently Jai isn't released so all we know is from presentations of this language, but it appears to work by decorating the declaration of the array with something which says how it will be laid out and the language generates the appropriate code without the programmer needing to change the code which is accessing it.

Now I would like to be able to do something similar in C/C++, so I had a bit of a think about it and have come up with this basic idea on how it might be possible to achieve something that comes close. Unfortunately to make it work requires some macros and templates, but nothing too fancy.


Here is the basic idea we want to arrive at:


template <size_t SIZE>
struct SomeStructure
{
  int     m_someMember[SIZE];
  float  m_someOtherMember[SIZE];
  // you could put anything in here, just each one needs to have "[SIZE]" added to the end
};



Then we can create arrays either way like this:



// As an array of structures
SomeStructure<1> AoS_array[1000];

// As a structure of arrays
SomeStructure<1000> SoA_array[1];



When we want to access them we do one of these:



int x = AoS_array[n].m_someMember[0];

int x = SoA_array[0].m_someMember[n];


Notice we need to swap the indexes. But we can do that easily with a macro if we know which type of array it is. One simple way I came up with was making a macro for when declaring an array which also sets a const bool which the accessing macro can use.

So here are the macros:


#define DECLARE_STRUCT_BEGIN(StructName) \
  template \
  struct StructName \
  {

#define MEMBER(t, name) \
    t name[SIZE];

#define DECLARE_STRUCT_END(StructName) \
  };

#define MAKE_AOS(array, S, SIZE) \
     S<1> array[SIZE]; \
     const bool array##_isSOA = false;

#define MAKE_SOA(array, S, SIZE) \
     S array[1]; \
     const bool array##_isSOA = true;

#define GET(array, a, n) \
     ((array##_isSOA) ? array[0].a[n] : array[n].a[0])


Probably this could be tweaked and improved, but it gets the idea across as to how it could be possible. I think it is fairly self explanatory.

Usage example:


DECLARE_STRUCT_BEGIN(S)
  MEMBER(int, a)
  MEMBER(int, b)
  MEMBER(int, c)
DECLARE_STRUCT_END(S)


MAKE_AOS(array1, S, 100)

MAKE_SOA(array3, S, 100)


void test()
{
  MAKE_AOS(array2, S, 100)
  
  for (int i = 0; i < 100; i++)
  {
    GET(array1, a, i) = i;
    GET(array2, a, i) = i;
    GET(array3, a, i) = i;
  }
}


I think I may be able to improve on this with templates instead of macros to to make the syntax more appealing, but just having something that works is a start and makes it a synch to switch the data structures to and from AoS to SoA to try out different layouts without mammoth amounts of work.

Also referring to my other post about serialisation and de-serialization (http://inspired-code.blogspot.com.au/p/automating-object-serializationdeserial.html) you could work the macros together where I presented a DECLARE_CLASS and DECLARE_MEMBER macro to automagically generate the code for serialising and de-serializing a structure, it shouldn't be too hard to combine the macros to combine the two ideas of declaring a structure/class that generates the serialisation and deserialisation and also is easy to switch from SoA to AoS. That is another improvement direction that this could be taken.


Along the lines of using templates to reduce the use of macros, possibly this is one direction to take it to do the declaring of an SOA or AOS and the consistent accessing in to either without relying on the preprocessor:



template <typename T>
struct ArrayAccessor
{
  explicit ArrayAccessor(std::functionsize_t)> func) : m_getter(func) {}
  inline T& operator[](std::size_t idx) {
    return m_getter(idx);
  }
  std::functionsize_t)> m_getter;
};


template <template<size_t> class t, size_t SIZE>
struct SOA
{
  template <typename T>
  ArrayAccessor get(T (t::*memberPtr)[SIZE]) {
    return ArrayAccessor([&](size_t idx)->T& { return (m_array[0].*memberPtr)[idx]; });
  }
  t m_array[1];
};


template <template<size_t> class t, size_t SIZE>
struct AOS
{
  template <typename T>
  ArrayAccessor get(T (t<1>::*memberPtr)[1]) {
    return ArrayAccessor([&](size_t idx)->T& { return (m_array[idx].*memberPtr)[0]; });
  }
  t<1> m_array[SIZE];

};



Then to declare an array as SOA or AOS you would do:

SOA<S,1000> array4;
AOS<S,1000> array5;

Access can be made to be the same:



#define ARRAY_MEMBER(array, member) \
    array.get(&std::remove_reference0])>::type::member)


Usage:


    ARRAY_MEMBER(array4, a)[i] = i;
    ARRAY_MEMBER(array5, a)[i] = i;


Ideally the templates could work out the type without needing so much typing or macros, but that is left as an exercise for the reader.

Tuesday, May 30, 2017

Avoiding heap allocation with strings


I was wanting to debug my memory allocations and hit a problem. The logging code that prints out useful information to help debug and trace the program is itself allocating memory. This is particularly pronounced with the string manipulations it does and with creating formatted strings from a variable list of arguments.

It got me thinking. Firstly when one has a std::string object, particularly if that object itself was on the stack, it still allocates off the heap, it is too bad there is no way to allocate off the stack as the lifetime of the object is on the stack anyway. This is just how it is in C/C++. You can sort of do variable sized allocations off the stack, such as alloca or a variable sized array, such as:

   size_t var = i * 10;
   char array[var];

Depending on the compiler, that can work. But anyway, this is no good for a general kind of string class as the lifetime of this is the scope it is created in so won't persist when the function returns. The same applies for alloca.

But I really want to solve this and completely avoid making heap allocations while logging. The main culprit is formatting a string from a format argument and a variable list of arguments.


So the first thing to do is work out the required size for this variable sized array to format in to.


void Logger(enum LogLevel a_level, const char* a_fmt, va_list a_args)
{
  va_list args2;
  va_copy(args2, a_args);
  size_t msgSiz = std::vsnprintf(nullptr, 0, a_fmt, a_args) + 1;
  char msg[msgSiz];
  ::vsnprintf(msg, msgSiz, a_fmt, args2);

  va_end(args2);
   ...
}


Now msg contains the formatted string and without using the heap. The only tricky thing is having to make a copy of the va_list because it gets consumed in the first call to vsnprintf and we need a fresh copy to pass to the 2nd call to it we make.

That wasn't too hard :)