LU-decomposition.
The block size is a tunable parameter. 16 and 32 are decent sizes. This decomposition returns the L and U parts embedded into a single square matrix.