What's the general process by which a new operator is added to PyTorch? Why is this actually something of a rare occurrence? How do you integrate an operator with the rest of PyTorch's system so it can be run end-to-end? What should I expect if I'm writing a CPU and CUDA kernel? What tools are available to me to make the job easier? How can I debug my kernels? How do I test them?

Further reading.

The README for the native/ directory, where all kernels get put https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/README.mdA high level overview of how TensorIterator works https://labs.quansight.org/blog/2020/04/pytorch-tensoriterator-internals/Where OpInfos live https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_methods_invocations.py