https://blog.openai.com/blocksparsegpukernels/

#1 chroem:
The thing is, it's a common misconception that neural networks are somehow intrinsically related to linear algebra. Matrix multiplications are just a convenient way to build functions with lots of tuneable degrees of freedom. Just like they do in finite element simulations, sparse matrices tend to allude to the fact that the underlying problem is more graphbased in nature. While I have no way to prove this, I've strongly suspected for a while that most of the weights in dense matrix deep learning models don't actually have an effect on the output, and that we've been unnecessarily burning cycles to compute their products. The trouble of course is figuring out which ones are useful and which ones aren't.