Groups Parameter of the Convolution Layer
One of the convolution layer’s parameters in PyTorch is the groups parameter. This parameter controls the connections between the input and output channels. In this post, I will describe different scenarios for the group parameter to better understand it.
Groups = 1
This setting is the default setting. Under this setting, all inputs are convolved to all outputs. Suppose you have three input channels as in a color image. Also the number of output channels is three. If you want each output channel to be a function of all the input channels, then you will be using the default setting of the groups parameter. With a square mask of size 5×5, you will use the following specification where the values of the parameters not specified are their respective default settings:
con = nn.Conv2d(in_channels = 3, out_channels = 3, kernel_size = 5, bias = False) print(con.weight.shape)
The above specification returns torch.Size([3, 3, 5, 5]) as the shape of the convolution masks. This implies 3x3x5x5 (=225) weight values involved in convolution. A pictorial representation given below captures how input and output are related.
Groups ≠ 1 and In_channels > Out_channels
Whenever a groups value other than the default value of 1 is to be selected, the chosen value must be an integer such that the number of input channels and the number of output channels are both divisible by this number. A non-default groups value allows us to create multiple paths where each path connects only a subset of input channels to the output channels. As an example, suppose we have 8 channels coming out of an intermediate convolution layer and you want to convolve them in groups to produce four output channels. In this case, non-default values of 2 and 4 are possible for the groups parameter. Let’s see how the grouping of channels takes place with groups = 4. In this case our convolution layer specification will be:
con = nn.Conv2d(in_channels = 8, out_channels = 4,groups=4, kernel_size = 5, bias = False) print(con.weight.shape) print(con.bias.shape)
The weight.shape in this case turns out to be torch.Size([4, 2, 5, 5]) and the bias shape to be torch.Size(). We can visualize the convolution paths as shown below:
However, if we set groups to 2, we find that weight.shape is torch.Size([4, 4, 5, 5]), meaning four filters with four inputs each. Thus, we can set the groups parameter to a value that aligns with the paths that we want to create.
Groups ≠ 1 and In_channels < Out_channels
Now let’s see how the groups value affects the resulting convolution paths when the number of output channels is more than the number of input channels. Let’s assume we have 4 input channels and 8 output channels. With groups = 4, we find the weight shape to be torch.Size([8, 1, 5, 5]). This choice of groups then results in 8 filters with each filter having only one input. On the other hand, groups value of 2 results in 8 filters with each filter convolving 2 input channels.
Groups ≠ 1 and In_channels = Out_channels
When the number of input and output channels are same, and the groups parameter is set to the number of channels, then each input channel is convolved separately to produce a corresponding output channels. This means a direct one to one connection is made between each input-output channel pair. When any other valid groups value is used, then that value specifies the number of input channels that will be convolved together along any path between input and output. Thus, with in_channels = out_channels = 4, and groups = 2, we will have 4 filters with two input channels being convolved per filter.
To summarize from above, we see that the ratio of number of input channels to the groups value determines the number of input channels that will be grouped per filter. Of course, the number of filters equals the number of output channels.
Convolutions performed using the non-default value for groups are called grouped convolutions. These convolutions offer two advantages. First, the grouped convolutions by virtue of offering multiple paths allow multiple GPUs to be used in parallel during training making the training efficient. The second advantage is the decrease in the size of parameters because of grouping. You may want to read about how these convolutions have been used in the Alexnet architecture, and in the ResNeXt.