Curious question.
A small disclaimer. Over the years, the entire institute course on the Theory of Automatic Regulation has flown out of my head as superfluous, so I will not vouch for the correctness of the logical constructions, and I will not give a concrete answer.
To comment about "recording vocals in mono". Usually, vocals and guitar are always written in mono (look at external sound cards - they do not have a stereo output for guitar and vocals: LINE and MIC are there), but when it comes to the audio editor (cube, audio), there is always one of the steps to bring the signal to stereo, something like stereo fx. So it's probably not the case.
I would put it on the fact that the LAFH / LFCH system is changing and the average frequency range is strongly cut where the main voice range is located. EMNIP, in (analog and early digital) phones, the width of a single telephone line is only about 300 Hz (usually at 20 Hz - 20 kHz headphones), so having muffled the mids as an equalizer, it’s just that the vocals are louder and quieter .
Specific mechanics find it difficult to describe. A small air gap may appear, which acts as an additional resistance / resistor inserted in series with the headphones. The resistance of the headphones themselves is small, if you do not have studio headphones and amplifiers, there are only about ten ohms, and a small power of the amp itself [in the sound card]. Other things being equal, with increasing headphone resistance and constant power, the sound will be perceived quieter, which you observe. I can not understand where the irregularity of change for different frequencies comes from - in theory, the resistor should not produce such changes LAFC / LFCCH. I would venture to suggest that the change is uniform: the volume drops in the entire frequency range in the same way (for instruments), but it is “by ear” that the lower frequencies suffer where the main register of the human voice is located.