Fiddling around with back-of-envelope DSP code has given me an estimate of cost for Quake-style texturemapping. Specifically - perspective correct at every pixel drawn (the best BadMood could achieve was per-line, or per-column of pixels, which is why that engine can't look up/down 'properly' or tilt the camera sideways, or render sloping floor surfaces - no chance of that happening there).
I now think it's practical to obtain the screenspace u,v texture coordinates from 3D surface coordinates in 16 instructions or less (needed at span edges), and a full texture mapping fragment is 29 instructions or less per pixel, before attempting paralleling optimizations, or other shortcuts such as calculating every Nth pixel (which remains valid - but uncertain yet if it is helpful given other constraints).
By comparison the optimized BadMood liquid shader required 22 instructions per pixel and that was fast enough for shading a big area at least in chunky column mode. The optimized direct texture & lighting version uses 8 per pixel, assuming a constant-z, and hides fully behind a move (a0),(a1)+ on the CPU side but there's no way to achieve that kind of speed and still be perspective-corrected for individual pixels, it's completely out of scope.
Anyway providing other problems don't stop the arithmetic from working when it is translated, I think there is nothing else preventing Quake texture mapping from working as fast as one of the shaders already used in BadMood. Given that there are enough other constraints limiting window size (particularly, span processing cost to build the image) I don't think fillrate is going to be the main issue when attempting this.
That leaves the hidden surface removal and span processing/retrieval problem. It's hard to estimate the cost for that because it's different enough from the other scenarios I tried before so it could still be a showstopper. But the hard problems are gradually dissolving so we'll see

I have some confidence in it working because it's similar to the kind of processing BM ended up using on floors, but still the amount of processing needed here is quite a lot higher.
The biggest problem I think in the end will be the sum of all these costs just being a bit too much for 16/32MHz, forcing some other decisions to be made. The individual costs are all manageable but added together, might still be too slow to be usable for a game. But even then it's probably close enough to be usable for something. By 'other decisions' I mean things like flatshading only, using simplified maps, using a small window (160x120) etc. etc. output limiting, not optimization related.
So I'll plod on over the next week or two as time permits and see if these things can be made to work as planned.