There is a way to, at least theoretically, significantly reduce the computational overhead of pixel sorting.
First you would have to use a sorting algorithm that works with arbitrary input lengths. Radix sort might work (but I have never implemented it myself). Merge sort definitely works. The trick with merge sort on a GPU is that you can find the n-th element of the merged sequence using binary search.
Next you need to know for every pixel which index it has in a span and how long that span is (if it is not to be sorted, you could set the span length to 0). In order to achieve that you could first build a balanced binary tree of the pixels. Each node of the tree would have to store the following information:
* Number of spans
* Ends with pixel that needs sorting
* Starts with pixel that needs sorting
* Number of pixels to be sorted
You can than descend that tree and know which pixel is at which position in which span.