This piece is greatly overdue as I meant to get it done in February but managed to hit major writer’s block on everything. Finally it is time.
So during the debate that settled absolutely nothing, during one of Mike’s responses that I still deem an irrelevant deflection (simply: what other papers have or have not done is of no relevance to the discussion of Brad’s paper), I was accused, as I suspected I would be, of having a hardon for Brad’s paper due to it disagreeing with my general recommendations.
First and foremost, this is just a deflection, one I expected, the typical “You’re just emotionally invested in this.” Which goes both ways and isn’t an answer to begin with. It’s the “What’s YOUR agenda?” bullshit that people love to use to put you on trial for asking a question about something else. I could similarly ask him what HIS AGENDA for defending Brad’s shitshow of a paper is. But I didn’t. Not in the debate anyhow. I know the answer anyway. Even moreso now than before. But I’ll take the high ground and not make it about that.
It’s also untrue as my general history clearly shows where large scale changes in the data set have caused me to change my personal model. The issue of fiber type conversion for example where I realize that what I was taught and what I had thought for over a decade is no longer supported by the research. I’ve done the same with endless aspects of diet, the whole accusation is baseless on top of being a non-answer to any of my questions.
I like new findings, I like being wrong (contrary to belief) because when I am it usually forces me to examine my current ideas, look at the literature in more depth or (as was the case with statistics), go learn something I should know in more detail. And will admit it when I am. Something that the guru crew and insecure babies in this industry can’t ever do. It truly is amazing how none of them seem to have EVER been wrong about anything. They are truly gods among men.
My initial issue with Brad’s paper has to do with an issue in the discussion (lying about the Ostrowki data to reverse the conclusion to agree with his) as that’s what put it in my target circle. Everything else came afterwards and the simple fact that the statistics still IN NO WAY supported the strongly worded conclusion no matter what word play or semantic games are used to make weak or non-evidence do so. Read this. This isn’t arguable even if people continue to try.
At best two muscles showed weak/anecdotal/not worth a mention support for the strongly worded conclusion. That’s not support of anything. It’s garbage. If a paper with those weak sauce stats had some out that Brad or James disagreed with, they’d have jumped it in a heartbeat as being incorrect and torn it to shreds. I know it, he knows it and anybody who knows him knows it.
But remember, stats only matter when it’s not YOUR paper just like proper methodology only matters when it’s not you being sloppy as hell. Brad expects others to do methodologically correct work and adhere to their statistics when making conclusions when he can’t/won’t do it himself. Just like Layne was the anti-guru until he was selling metabolic damage and reverse dieting seminars. Then every strategy he used to criticize became fair game (“I have hundreds of emails” which is not science). I digress.
Simply, Brad’s study did NOT contradict my recommendations because the stats didn’t support the conclusion to begin with and a non-result doesn’t contradict me or anybody else. If you take the moderate group as best (i.e. better than the lower volume group) which the stats actually supported, it’s mostly consistent with the other studies volume wise (the leg volume is much higher at 27 sets but only compared to 9 which tells us nothing about what happens in the middle).
If you don’t count the sets moronically it’s even closer because all the numbers get cut in half to begin with. Put differently: if you take the study’s ACTUAL results (instead of the bullshit claimed results) at face value and don’t count sets like a dipshit, they really don’t contradict the other literature. But that’s never what it was about to begin with.
But making it about me and saying “You only dislike this because it doesn’t support you” is what I expected to occur. Question my motives, my action, my motivations to avoid a direct answer to the question. Great for politics which is Mike’s true future if he’ll go that route. But it’s total bullshit in what is meant to be a proper debate where direct questions should get directly answered rather than deflected with smokescreens.
The Barbalho et. al. Paper: Introduction
In that same vein, Mike asked me why I had not done a similar analysis of the Barbalho paper that came out last year on women which found that volumes significantly lower than the 10-20 I suggested in my research analysis. He figured it must too contradict my general recommendations of 10-20 sets so why hadn’t I taken it apart?
As noted, partly because Brad’s results actually don’t contradict anything so it’s not comparable. Well not if you look at the statistics correctly and don’t count the sets like a dipshit. The main reason was honestly time; I’ve had notes on that paper on my desktop since the paper came out last year and I simply got occupied with other things. The prepublication paper also had images I couldn’t see at full size to show or even examine the data. So it was on the backburner.
And Mike’s bringing it up was another in a long series of smokescreens and deflections but that’s all he brought to this gunfight. Oh in case you missed it, both Brad and Mike said that there was a word limit on Brad’s paper to excuse how he presented the Ostrowski data in such limited terms.
This is a boldfaced lie.
MSSE has no word limit on manuscripts (it has one on titles and abstract ONLY as you can see if you click the link). Just for the record. It’s one of several lies Brad and James have thrown out during this thing that nobody wants to address. I’ll be collecting them all shortly because Brad, Mike and James have made a number of out and out lies throughout all of this.
In any case, I don’t consider the Barbalho paper on women to be part of what was being discussed with regard to Brad’s paper because women are not men and you can’t automatically consider them within the same dataset. I wouldn’t consider a study on elderly men any more relevant to the discussion that was at hand. As well, this is still a deflection a “If you’re analyzing this study so hard, why not this other study.”
It’s akin to “If one man is on trial for drunk driving, why should I not be looking to see if other men are driving drunk?” Because those other men are not on trial. Barbalho is not on trial. I am not on trial. My website is not on fucking trial. Brad’s study was on trial and nothing more. But smokescreens will always remain easier (and fool the gullible) more than direct answers. Mike, you and politics. Seriously. I wouldn’t vote for you but you’d be good at it.
However, that doesn’t mean that I shouldn’t analyze the Barbalho study on fundamental grounds and I wanted to delve into it to begin with since it came out. The results are a little confusing given other (admittedly indirect) data and I did want to examine it in detail anyhow to see what was going on. And now I’m going to.
And to make sure nobody throws a tizzy, I will make sure to examine it with as fine a toothed comb as I did for Brad’s paper, quite in fact I’ll go even deeper into this so that none of the guru circle jerk can even attempt to claim I ignored something. Having just looked at it again, Mike has actually done me quite the service since many things done in Barbalho are, well, no spoilers.
This research review will be a little different than the others. First it’s a lot longer since I will be looking at issues that I have typically ignored or skated on. I’m not ignoring anything on this one so Mike can’t whinge (that is the correct word, look it up) about it or make some minor nitpick to dismiss the whole thing (another common guru tactic).
And in that vein, given what stimulated this little exercise and why, I will be including what I call Mike notes in the discussion. Callbacks to what he said in ‘defense’ of Brad’s article and how it relates to this one to see if they actually have any merit. I’ll put them in quotes as needed. I’m also going to put this paper’s methodology up against Brad’s at the very end to make the point. Let’s do this.
Be forewarned, this is super tediously long and super tediously ranty. You’ve been warned.
Barbalho M et. al Evidence for an Upper Threshold for Resistance Training Volume in Trained Women. Med Sci Sports Exerc. Med Sci Sports Exerc. 2019 Mar;51(3):515-522
Introduction: The purpose of the present study was to compare the effects of different volumes of resistance training (RT) on muscle performance and hypertrophy in trained women. Methods: The study included 40 volunteers that performed RT for 24 weeks divided in to groups that performed five (G5), 10 (G10), 15 (G15) and 20 (G20) sets per muscle group per session. Ten repetition maximum (10RM) tests were performed for the bench press, lat pull down, 45º leg press, and stiff legged deadlift. Muscle thickness (MT) was measured using ultrasound at biceps brachii, triceps brachii, pectoralis major, quadriceps femoris, and gluteus maximus. Results: All groups significantly increased all MT measures and 10RM tests after 24 weeks of RT (p<0.05). Between group comparisons revealed no differences in any 10RM test between G5 and G10 (p>0.05). G5 and G10 showed significantly greater 10RM increases than G15 for lat pulldown, leg press and stiff legged deadlift. 10RM changes for G20 were lower than all other groups for all exercises (p<0.05). G5 and G10 showed significantly greater MT increases than G15 and G20 in all sites (p<0.05). MT increased more in G15 than G20 in all sites (p<0.05). G5 increases were higher than G10 for pectoralis major MT, while G10 showed higher increases in quadriceps MT than G5 (p<0.05). Conclusions: Five to 10 sets per week might be sufficient for attaining gains in muscle size and strength in trained women during a 24-week RT program. There appears no further benefit by performing higher exercise volumes. Since lack of time is a commonly cited barrier to exercise adoption, our data supports RT programs that are less time consuming, which might increase participation and adherence.
Note: This was published in the SAME journal Brad’s volume paper was so you can forget trying the kind of appeal to authority that was made in MASS about MSSE being ‘a top tier journal’ as if that means they can’t publish crap.
So the background on this has to do with the dose-response relationship to hypertrophy and strength gains. Or rather, what volumes give the optimal or best growth/strength gains. There are various ways to define optimal here. So say we have a situation where 3 sets gives 2.5% growth, 6 sets gives 4% growth and 12 sets gives 5% growth (these are made up numbers). Doubling your sets from 3 to 6 only gets you 1.5% more growth.
Doubling it again gets you 1% more over that and only double doing 3 sets. So 4 times the volume gives you only double the benefit. Mind you, this assumes that more work actually does give you more gains which is only sometimes true. Don’t focus on the specifics but the logic: in this example, quadrupling your total work gives you double the results. So we ask the question:
Which is best/optimal: 3 sets, 6 sets or 12 sets?
The answer is context dependent. For the psychotic athlete at the top of their sport looking for every percentage point possible (or every percentage point in a fixed unit of time), doing the maximum may be best (assuming they don’t get hurt of course). They have the time and drive to do it and that extra 2.5% may be the difference in winning and losing at the highest levels (where first and 10th may be decided by 1% differences).
I mean, you can finish a marathon running 4 days per week with one long run. But having a chance of winning takes 100-120 mile weeks and running one or more times daily. Which you choose depends on your goal (and realistic chance of winning). Elite runners with a chance to win put in every mile they can survive. Recreational runners don’t. Or shouldn’t. You can get pretty far in Olympic lifting training only 3-4 days per week but making the podium probably requires triple that (and lots of drugs). Elite athletes do as much as they can survive (and may do not). The rest of humanity needn’t and shouldn’t.
If you’re a general trainee and have limited time, the lower volume of training may give you the best return on your time as well as all the gains you want. No matter how you cut it, it certainly gives you the best return on investment no matter what since quadrupling your investment to double your gains is a poor ROI.
Excessive time requirements for training are frequently offered as a huge barrier to exercise and getting the best results for the least time may be the best choice (consider everybody’s hardon for HIIT). By training for the best ROI you can still make gains while having a life.
Even if they are slightly less over the short-term, it may not matter. Eventually you hit your genetic limit so does it really matter in the big picture whether you get there a little bit faster or slower? Whether it’s 3 years or 4 years, does it really matter in 99% of situations? No, it does not. And no, Mike, doing it this way doesn’t somehow make it 7 years to get near your limits versus 3. It’s a small difference and if you have an issue with that take it up with Eric Helms who gives similar numbers to mine and with whom you have far more disagreement.
Now a lot work has been done on men. Most of it is in beginners which I don’t care about (a lot of the beginner work is also done on women but uses low volumes). A lot of it in older folks which I also don’t care about (ok, that’s not true, I do care about it, just not within the context of this particular research review or topic). Usually low volumes work as well or maybe a little bit less well than slightly higher volumes but they are usually comparing like 1 to 3 sets per exercise or 2 to 3 days of training or some combination of the two so it might be 2 vs 6 sets per week or 3 vs 9 total sets or something.
And you usually see that 1 set per exercise done twice a week gives most of the results of 3 sets twice per week or adding a third day of training. From memory, it’s usually in the range of 80% of the results with the lower volume/frequency. For the general population who just needs basic strength fitness and health, there’s no point in tripling the length of the workout or adding a third day of training for only a small return. Especially since you frequently see more dropouts. You want the most efficient training per unit time. And low volumes win.
But so far as this topic, it’s more about an examination of the dose response curve of training volume to muscle growth. I examined this for the extant studies on men here but this is the first paper I’m aware of to both examine it in trained women and used relevant training volumes (i.e. not comparing two very low volumes of training like 3 to 6 sets per week or whatever).
Women and Training Response
I’d note as a prequel to this that there is good reason, both scientific and empirical to believe that women can both handle and might benefit (or even require) more volume than men. Some of this is somewhat indirect, looking at fatigue and recovery for women and men after fixed exercise bouts, or lactic acid or fatigue during aerobic or interval work. In many cass at least, women generate less fatigue during a workout and women recover more quickly both within a training session and between training sessions than men.
This isn’t universal and depends on the type of workout but I’m saving that for Volume 2 of The Women’s Book. But in the aggregate, this would suggest that, assuming you accept some degree of fatigue as required for a training stimulus, women would need to do more work to get the same fatigue as a man and generate the same gains.
That is certainly the framework I and most others in the field work from past a certain point in a woman’s training at least (very little of this stuff applies to relative beginners): that more advanced women may need slightly more volume than men for optimal results (Mike and I agreed that as a rough guideline they might need ~30% more sets than men which might sound like a lot but is only a few sets more so don’t go nuts. A man who did 10 sets in a workout means the woman is doing maybe 13). Mind you, others have argued the opposite, that women should need/can handle less work due to lower levels of testosterone. Both arguments probably have some merit depending. Take your pick.
But very little work has studied it systematically in the weight room, especially over long periods of training. It’s usually just acute workouts. In fact, this paper cites only one paper on ‘recreational weightlifters’ done with both women and men. It compared 3 sets to 9 sets per week of exercise with both groups increasing strength equally. It used skinfolds to track body comp which is pretty poor and I’d draw zero conclusions here about body composition changes. It was in folks with only 1 year experience and concluded that gains could still be made with 1 set per exercise to failure done three times per week. Outside of that, there appears to be no other work comparing volumes in trained women at least as of the time that Barbalho was prepublished or the time I got to writing this thing.
So this study is extremely novel for that reason alone: it’s really the first systematic study to examine the topic using any sort of different volumes. That, of course, also means that it only represents a single finding since the rest of the work is in men. So as Mike repeatedly pointed out, I will try not to overvalue it. I don’t think it belongs with the men’s data for what should be obvious reasons so it’s really only a singular data point and will need replication or contradiction before we can draw any meaningful conclusions.
I will, however, take it’s results at face value to be sure. But like they harped on, one study is one study. Except that in this case, it’s not one more study added to a previous body of literature like Brad’s paper was (it was one of EIGHT). It’s the ONLY study yet done so any conclusions drawn must be extremely tentative. Science rests on replication and one study means jack.
As stated in the introduction of the paper:
No prior studies have considered trained women, and many have not considered set volumes much higher than 10 per muscle group per week. Considering the controversy around the topic and the importance of defining an adequate dose-response for muscle hypertrophy and performance in women, the aim of the present study was to compare the effects of different volumes of RT in these outcomes in trained women. Our hypothesis was that different training volumes will result in similar increases in muscle size and strength.
That last sentence is interesting in that their starting assumption is that there will NOT be a difference between volumes which is the opposite of what you might expect for someone looking at volume of training and the strength or size response. Don’t get me wrong, the hypothesis and null can go in either direction and they could have hypothesized that the higher volumes would in fact generate larger growth or strength responses with the null being that they wouldn’t.
But it is interesting to consider that they might be going into the study with a preconceived notion about the response to training volume. Factually this research group seems to have somewhat of a low-volume/HIT bias with most of their studies supporting lower volumes which might explain it. I’m simply mentioning that as an observation, not a dismissal and ultimately the study stands on it’s design, methodology and results or it doesn’t. Irrespective of that, I will now bore you to tears with the most involved research review I’ve ever done. Mike wants a fine-toothed comb for consistency, well I got my fine toothiest. Get a snack.
So the study intended to examine the dose-response relationship to varying training volumes in women, as described above and I’ll outline the specific methodology below.
To determine how many subjects would be required, the study did what is called an a priori sample analysis with the goal of achieving a 0.6 effect size (ES) with a power of 0.8. What this essentially means (and I had to farm it out to a stats buddy to avoid making any dumb mistakes) is that they consider anything less than a 0.6 ES to not be practically relevant (i.e. too small to be real-world significant). This is actually an important thing to consider.
Often in research, a statistically significant difference may be real-world irrelevant in absolute terms. And a non-statistically significant result may be real-world relevant (except for the fact that you can’t even state that the result was ‘real’ if it didn’t reach significance to begin with). Relative ES’s are kind of useless since we don’t know what they real-world translate too but no matter (if a large ES translates to a small-real world result it kind of doesn’t matter).
Yeah, a 0.6 ES is bigger than 0.2 (or whatever) but if the real world difference is irrelevant who cares (and note that basically all of the reviews that Brad and James and many groups churn out use relative ES’s which tell you exactly jack shit about the real world impact in absolute magnitude). They decided that even if growth were “better” for a given group, if the ES was less than 0.6 it didn’t matter practically.
Anyhow, the power of 0.8 means that they want to have an 80% chance of correctly measuring such a change. This leaves them open to a 20% chance of not detecting an effect that is real (i.e. a false negative). By this they determined that 35 participants would be necessary. To ensure that they had enough, they kept recruitment for the study open until they had 40 subjects since there are usually dropouts. They wanted to ensure the paper had enough subjects to draw good conclusions.
Mike notes: In the debate Mike brought up the very real problem with having sufficient study subjects. Even in Brad’s paper, the original goal number of subjects fell short since some folks dropped out which gave it less statistical power than it could have had. This is a problem in science all over the place, especially in exercise science where it’s usually college students who can give up 8 weeks to get credit for the study and you have to run the study when you have them.
And yet this study managed to find a way around it. They determined how many people they needed and waited until they had more than that to make sure their conclusions would be more valid. It can be done. So saying that “This is endemic in sport science so we should give a pass to weak papers” is a lame deflection. That others papers are poor doesn’t make it ok for any given paper to also be poor. Especially when the problem can apparently be solved if you actually bother to try.
All volunteers had to be 18 years old with no clinical conditions that could be aggravated by the study (this tends to be more of an issue for women than men) and this was confirmed by a physician. They had to have been training continuously for the last 3 years a minimum of three times per week so it wasn’t untrained noobs who basically get the same results no matter what you do. Most were used to training each muscle group once or twice per week with 18 sets for upper body and 24 sets for lower body although no more specific information was provided about their training except for a one-off in the discussion I’ll come back to.
Previous training style is being mentioned more and more in studies as some of the apparent ‘changes’ in size or strength may be due to training in a novel way rather than due to the specific intervention. They established a minimum attendance of 80% to be eligible for data analysis (so anyone who made less sessions than that didn’t have their data considered). At first glance, these appear to be fairly well-trained women (3 years experience) but as I’ll discuss at the very end, I am not sure this is entirely true and that alone may be impacting on the results.
Diet was not controlled but it literally never is in these studies (and self-reporting tends to be, well let’s just call it weak). It costs way too much money but acts as a HUGE confound. Subjects often lose fat in these studies which alone would tend to limit muscle gains. On that note, many if not most studies will report initial and post-study anthropometrics. Height, weight, body composition if it’s measured etc. And this is important data to show.
If they measure body composition it’s normal to report those values and changes. This can be super useful when the diet issue comes up. If someone loses a bunch of fat in a study, they were likely undereating and that throws a wrench in things. They might not have gained any muscle due to eating too little rather than anything else. Even starting and ending weights can be marginally useful. If someone loses weight, they weren’t eating enough which skews the results. If someone gains, at least calorie intake was probably sufficient although you can’t de facto conclude anything about actual body composition changes.
In Brad’s study, only the group anthropometrics were provided, without being shown on a per-group level. Presumably there was no difference but it’s still atypical and most studies provide the numbers for each of the study groups (and assuming proper randomization was done, they should be more or less identical). And no post-study data was provided. Not even weight which is simply bizarre. Couldn’t Brad have thrown the subject on a scale when he was doing the unblinded Ultrasound? Apparently not.
However, in the study I’m looking at today, while individual group anthropometrics were shown individually for the start of the study, post-study data was not provided. This is a major oversight as even reweighing the subjects would seem fairly trivial (presumably height would not change). Knowing how and if weight changed for the subjects over the length of the study would have been useful. So if one group lost weight and that group didn’t gain much muscle, it could have been a diet issue. It is a severe oversight that I would be remiss to ignore and a seemingly easily corrected one. This is a big negative point for this study, one that I do not understand since it seems so obvious to do and is trivial time-wise.
The subjects underwent two testing batteries, once before the study and once after it was done a whopping 6 months/24 weeks later. This was a long study which is extremely atypical and I’ll come back to this later in this writeup since I think it’s important in the overall results.
For strength testing, the subjects tested their 10RM on the bench press, lat pulldown, 45 degree leg press, and SLDL over a span of 3 days and a very standard protocol to determine 10RM was used. They state that:
The 10RM was chosen over the 1RM because when participants are training at high repetition ranges, it seems more appropriate to evaluate performance through multiple repetition tests (33).
Which makes great logical sense and makes me wonder why so many studies test 1RM when their workouts are based around high repetitions. One of the defenses over the total lack of progressive strength gains (i.e. more volume = more strength) in Brad’s study (a finding that more or less contradicts the entire body of work on the topic) is that they were working at 8-12RM and that shouldn’t increase 1RM. Perhaps although repeat sets of 8RM sure as hell should. Or perhaps the lack of strength gains was due to there being no real difference in muscle gains (which is at least as ecologically sound an explanation given the fact that increased myofibrillar growth should increase strength).
Ok, now here’s where it gets fun for me:
Muscle thickness was measured by Ultrasound, the same Ultrasound that is used in a majority of these studies. Subjects were tested at week 1 and week 24 for biceps, triceps, pecs, quadriceps and gluteus. This raises again in my mind why so many studies insist on using nothing but biceps, triceps and quads (vastus medialis and/or rectus femoris). Clearly you can do pecs, you can do glutes, I suspect you could probably do back if you put your mind to it and I’m pretty sure I’ve seen delts measured somewhere. There are entire socities dedicated to anthropometric measures, this stuff can surely be done somehow.
So why aren’t they being done in other studies so that reasonable conclusions can be drawn about set counts and growth and optimal volumes? That is, why do so many studies use nothing but compound pec work and then not measure pec growth? I don’t know the answer to be honest. Is it technically more difficult? If so, what good is being a ‘trained’ Ultrasound tech? Is it a time thing? Or just abject laziness and more sloppy methodology? I truly don’t know.
Another Rant about Set Counting
Because if you give someone 30 sets of compound chest work and then measure triceps, I don’t think you can make any sort of meaningful conclusion about set counts for optimal growth for pecs OR triceps. If you want to know how set count correlates with growth with chest work, measure the fucking pecs directly. Alternatively, count sets intelligently and say that your 30 sets is about 15 for triceps so that you don’t try to oversell bullshit online. Or do direct triceps work and measure the triceps. 30 sets of compound chest is not 30 sets of effective triceps work and everybody knows it except those trying to sell crap and “results that change everything!!!!”.
I am aware that Brad is NOW saying “Higher volumes potentially increase growth” which is a cute qualification months later (and still not supported by his own goddamn statistics). This didn’t prevent he or James from selling the “Our volume numbers will BLOW EVERYTHING OUT OF THE WATER” or their paper conclusion so I don’t give a fuck what they are saying now. Their initial conclusion and how they presented it publicly was made in strong terms no matter what semantic hoops Mike wants to jump through.
That they finally pulled back after they realized they couldn’t keep selling bullshit is irrelevant to me unless they flat out said “We were wrong to present our data that way initially” And we all know that the guru can NEVER admit to being wrong and that with the greatest of all likelihood neither of them did (remember, I can’t follow their nonsense directly since both pansies blocked me for being mean so I’m usually relying on second hand reports).
Because even IF 30 sets for chest gives the best response in the triceps, that doesn’t mean that 30 sets per is best for growth overall without direct pec data. Certainly not for pecs which were not measured. And it means, roughly, that 15 sets might be optimal for triceps. And as I mentioned before, even if you take Brad’s results as true (i.e. that high volume was best which the statistics still don’t support) and remath the set count rationally, they come back down to much less moronic levels very much in line with the broader body of literature.
The high volume of 30 sets for chest becomes 15 for triceps which, voila, is between 10 and 20. The highest legs at 45 becomes mid 20’s (due to the leg extension) but with no real middle ground tested (and legs possibly needing more volume). The 1:1 set counting thing is just dumb as fuck although somehow the paper I’m examining now found a way to make set counting even stupider.
Back to Barbalho et. al
All subjects were tested at 7-8 am after a standardized breakfast and 3-5 days after the last training session to minimize swelling (and again they could have done bodyweight easily here). Now this is interesting. One of the more contentious points about Brad’s study was the time point of 48-72 hours (2-3 days) to “reduce edema” since there is quite a bit of research (examined by Lucas Tufur) suggesting it’s very much not gone at that time point.
The study everybody cites to support 2-3 days as sufficient used 9 sets of bench press per week and there is no indication one way or the other that those results hold for high volumes, especially not 30-45 sets/week. They might or they might not. But you can’t automatically assume that it does hold or use it as evidence for a paper using 3-5 times that volume. This is a pilot study that needs to be done: figure out the time course for edema for different volumes of training. Have folks do varying volumes and measure thickness daily for 7-10 days to see where rapid drops stop as that will tell you when edema is gone. Someone needs to answer this question.
Especially when you couple that with the issue of extracellular water increasing and you have a place where fluid shifts could be throwing off the Ultrasound if you measure too soon after the final workout when looking at very high volumes. The Haun study Mike himself was involved with only found significant water retention past 20 sets/week which this study did not cross but three of the groups in Brad’s study (30 set triceps, 27 sets legs, 45 sets legs) DID CROSS. The ones that Brad and James, despite zero/worthless statistical support CONTINUE to claim grew the best. Hmm…
There is also that odd little study I cited in my writeup of the newer Haun et. al analysis on sarcoplasmic hypertrophy which found that all the growth after 3 months of training was gone after a week. So it was just fluid shifts. Suggesting that 3 days may not be not long enough (of course even the 5 days in Barbalho would also be insufficient and that’s why we need this data). Now, if data comes out showing that it factually is gone at the 3 day mark for high volumes, I’ll admit I was wrong.. If that same data comes out showing that it isn’t, do you think Brad and James will admit they were? Because for weeks, James would cite a paper, then dismiss it and switch goalposts. Because gurus are NEVER wrong.
In any case, Barbalho et. al picked 72-120 hours (3-5 days with 3 admittedly being on the low end) instead which I suspect will give a much more accurate indication of real muscle growth versus just rapid fluid shifts which can in fact do occur during and right after training (and a week might be better). This is especially true given the length of the study, a point I’ll come back to below. I’d consider that an enormous strength compared to most research that is just likely to be measuring too soon.
Since nobody is losing real muscle after 3-5 days of ending training, waiting longer to take the measurement should be a lot more valid in this regard. Once fluid shifts dissipate (and we still need to know exactly when this happens), the changes you’re measuring are actual muscle. And another point in favor of Barbalho over Brad’s study. It won’t be the last.
Irrespective of that, check out this next sentence (MT is muscle thickness):
All MT measures were performed in a specialized clinical center by the same experienced technician, that was not involved in the study and who was blind to group allocation.
Basically they got someone not involved in the study who didn’t know who was in which group to do the measurements. HOLY SHIT, IT CAN BE DONE and THAT is how you reduce bias (as per Cochrane guidelines which, Mike’s warbling to the contrary absolutely SHOULD fucking apply to exercise science).
Mike Notes: So one of my major beefs with Brad’s paper was that he did the Ultrasound himself and he wasn’t blinded to the subjects. He knew who was in which group. Mike’s general defense was that a lot of studies aren’t blinded and a lot of science is sloppy so it’s ok, the same non defense of “Since he did it, it’s ok if I do it too”. Which is horseshit to begin with but let’s go with it. Because here we have a paper that he challenged me to analyze that DID in fact blind the Ultrasound tech. Perhaps he should have read it in full before laying down the gauntlet since his challenge just bit him in the ass.
Blinding the tech can be done and factually IS done by labs that care about good methodology (the lab that Mike was involved with on his volume study blinds the Ultrasound more often than not). Brad is apparently just too sloppy or lazy to do it after years of doing research AND supposedly teaching a research methods class and somehow that gets a lame “Others do it so it’s ok” defense. Blinding the tech can (and should) be done when and whenever possible. If Brad can’t figure out a way to do it perhaps he needs a refresher class on good methodology or should collaborate with a lab that knows how to do things correctly. Or change careers to something less methodologically demanding. Starbucks is always hiring.
So so far I’d say this paper is head and shoulders above Brad’s methodologically. They did an a priori power calculation (Brad did this too, so spare me the strawman, James) and actually got enough study subjects to meet it (Brad’s paper ended up with fewer subjects than they needed due to dropouts), ran the study for 6 months rather than 8 weeks, choose a longer time period after the final workout to measure muscle thickness and blinded the Ultrasound tech who had no clue about the study itself much less who was in which group. It’s truly almost as if you can do methodologically sound science if you care enough to bother.
Women and the Menstrual Cycle
On that note however, while the study was 24 weeks/6 months long, it does not appear that the phase of the menstrual cycle was controlled for for any of the testing or measurements (it is not mentioned anywhere in the study). Now you might argue that since it’s 6 months, the women should fall roughly in the same phase of the cycle and this doesn’t matter except that would be wrong as the idea that the menstrual cycle is exactly 28 days in length is incorrect.
It can actually range from 24-32 days between any two women and not all women are even consistent month to month. A woman with a slightly shorter or longer cycle would be processing or precessing (look it up) through their phases of the cycle. A 2-3 day difference over 6 months is 12-18 days difference which puts a woman in a totally different part of her cycle.
Now on the one hand this is inexcusable for any study on women in this day and age. A lot of early studies (I mean in the 1970’s) drew totally incorrect conclusions by not paying attention to it at all. At the time, it simply hadn’t occurred to anybody. Now we know that it’s critical to getting good results. On the other hand to do it correctly is really a bear.
Earlier studies relied on women counting days from menstruation (i.e. ovulation occurs on average 14 days afterwards) but this is terribly inaccurate. More modern studies use near daily blood hormone analysis, vaginal Ultrasound or both which is expensive and technically demanding. Let me note again, blinding an Ultrasound tech is free so it’s not a valid comparison to make between the two issues as to why one might be ok but the other is not.
Don’t misinterpret me here, I’m not saying this was ok or that the difficulty excuses it, just that it is not comparable to blinding the fucking tech (again, free to do). But it is a major limitation to the study. Because we simply don’t know in either direction, if the changes known to occur in water balance throughout the cycle do or do not impact on the measurements of muscle thickness. It might, it might not. We simply don’t know. And it’s important to know this, or control for phase of the cycle, going forwards to make sure that any measurements are not being impacted in EITHER direction, up or down.
There’s another pilot study to get done before more of this research is performed: find out how changes in body water balance throughout the menstrual cycle impacts on Ultrasound measurements so it can be taken into account when and if it cannot be controlled for. I know that I “Don’t do science” but this seems trivial as hell to do in a methodological sense.
Take a group of women and do Ultrasound (with a blinded tech) during each week of the menstrual cycle to see what changes (I’m not even talking in relation to training, just baseline changes) occur in muscle thickness. Throw in a measurement of water balance and you know what is a real change and what is fluid shifts. Boom, now we can say that there is an increase or decrease of so many mm in muscle thickness at different parts of the cycle so that you can make a correction for any study that doesn’t measure menstrual cycle phase direction. We need this data before any more of this is done in women. Or the menstrual cycle phase must be controlled to account for that potential confound.
The Training Program
Ok, so the training program is weird and, honestly, kind of stupid. The women were put on a three day per week split routine (push/pull/legs) with each muscle being supposedly hit once per week and, as above, the study was 24 weeks which is quite long relatively speaking. The Haun et. al study Mike was involved with was 6 weeks, Brad’s was 8 weeks, Ostrowski was 10 weeks and the two GVT studies were 6 and 12 weeks respectively.
6 months is a lifetime for a study like this and I’ll come back to this below since it’s super important. Only the Raedelli trash fire paper was the same length but it was in beginners, the results don’t make an iota of fucking sense and, meh. In no world do beginners not get growth until they do 27-45 sets/week or whatever the insane results were. Anyhow.
Ostensibly, the goal of the study was to compare volumes of 5, 10, 15 and 20 sets per week and this is how most people (who seem to have stopped at the abstract) reported it. But this isn’t actually what they did in terms of the muscles they measured. Shown below is how the workouts were set up along with the set counts. And it’s dumb.
Ok, so look under G5, which they call 5 sets per week. For Monday they performed this as 2 sets for bench press, 2 sets for incline barbell bench and 1 set for military press. So it’s 5 sets for the workout but only 4 of those hit chest with 1 for shoulders (in G10 it’s 4 for bench, 4 for incline and 2 for military, in G15 5,5 and 5 and in G20 7, 7 and 6).
Which means that, despite being called 5 sets/week it is at best 4 sets for the chest, which recall was measured for growth. Based on my consistent assumption that a compound movement is worth half as many sets for smaller muscles, I’d consider that as 2 sets for the triceps (which was also directly measured) They didn’t measure delts so the set count there doesn’t matter but it will add an additional 0.5 sets to triceps. So the G5/5 ‘set’ workout on Monday ends up being 4 sets for chest and 2.5 sets for triceps by my counting. It’s not 5 sets for any muscle group no matter how you count it.
For G10 it’s 8 sets for chest which yields 4 for triceps plus 1 more from the shoulder press which is 5 sets for triceps. G15 is 10 sets for chest which is 5 sets for triceps plus 2.5 for military press for 7.5 sets for tris and you get the idea. Considering any of these workouts to be 5,10,15 or 20 sets is asinine but tells you who never read past the abstract because that’s how most reported on the study.
On Thursday it’s similar. Back wasn’t measured so that kind of doesn’t matter I’ll count every compound pulling exercise as half a set for biceps. So thankfully it’s the same 2.5 for biceps (and even here we might quibble about the biceps involvement in the upright row but I want to retain consistency across my analyses). This goes to 5 on G10, 7.5 on G15 and 10 on G20. And even though it’s a pull movement, upright row is really a delt movement (which in a sense means it should go on Monday). It doesn’t really matter since delts weren’t measured (and biceps are the supporting muscle rather than triceps).
And for legs, holy shit this is a mess. So they measured quads, right and that’s easy. Two compound exercises hit quads and if we use my standard counting that’s 2 sets for quads on G5, 4 on G10, 5 on G15 and 7 on G20. The SLDL didn’t hit quads so it doesn’t count and their choice to use 5 sets to describe the whole workout rather than per muscle is really confusing to me on top of just being dumb as shit.
Finally, I am at a loss as to how to count glutes in all of this. Leg press and squat certainly do hit the glutes to one degree or another but do we count SLDL? I’d count an RDL for sure. But SLDL? I just don’t know and I’m not sure anybody else does either. If you have thoughts, put ’em in the comments. Big picture it doesn’t really matter since all muscles showed the same response, honestly and you can consider my set counts below for glutes as a bit of a guess.
Seriously, this is not a well set up workout for what they are claiming to be examining in terms of set count since in no case does the described set count match the actual set count for the muscles that were being measured. Well, I suppose triceps and biceps do IF you count 1:1 for compound movements which I disagree with to begin with.
I’ve attempted to math out what I think a reasonable real-world sets per muscle group value would be in the chart below. Note that I’m using the same assumption of 0.5:1 for compound to isolation that I used in my previous series so as to be consistent. I’ll always math papers like this from here on out since only gurus change their choices when it suits them. It’s not 1:1 or 0.5:1 until it’s no longer convenient to be that.
In each of the 4 columns I’ll show their numbers (based on 1:1 counting for the muscle in question which I think is wrong) in red and my values (assuming 0.5:1 compound to isolation) in black. In exactly one case, they match and most of the time they do not. For lack of any more sensible way to do it, I’ll just count all three leg movements as half the number of sets for glutes. This is as much of a guess as anything and you can give my numbers the least weight of all on that muscle.
|Glutes (essentially a guess)||5/2.5||10/5||15/7.5||20/10|
But for real this workout is a mess and even talking about this thing in terms of 5 or 10 or 15 or 20 sets is missing the point. As I said above, you can tell who didn’t read past the abstract since they won’t have noticed how bizarre the set counts actually are. Even on pecs which is the only place I think it’s fair to count exercises 1:1 it’s still weekly volumes of 4,8,10 and 14 since military press isn’t a chest exercise (spare me that fact that maybe there’s a little pec if you lean back far enough). But it’s not 5,10,15,20.
For biceps and triceps it’s 2.5, 5, 7.5 and 10. It’s 2,4,5 and 7 for quads and well, glutes is a guess. In no case is it the described 5,10,15,20 in the strictest sense. Yes, it’s 5, 10,15, 20 sets per workout but it’s not 5,10,15,20 sets per measured muscle group unless you stick with a dumb 1:1 counting scheme from compound chest/back to triceps/biceps. Which I still disagree with.
And essentially nobody counts sets that way (though I seem to recall one review paper on strength gains that did something stupid like this, counted total upper body sets per week without dividing it by exercise or muscle which is just dumb as hell). Or shouldn’t. Yes, total set count per workout can be relevant to know sometimes but if you’re talking about a growth response for muscle X in response to Y sets, you don’t count all the sets in that workout if they don’t hit that muscle, right?
I mean, if I did 10 sets for chest and 10 sets for calves in a workout, I wouldn’t call that a 20 set workout if I were concerned with my weekly chest or calf volume (which is 10 and 10). But that’s sort of what they did here (yes, they grouped push pull and legs but it doesn’t stop the per workout set count from having no relationship to the per muscle set count). Saying that 4 sets of chest and 1 of shoulders is 5 sets ‘per workout’ makes no sense. Like I said, they managed to out stupid the 1:1 counting convention with this approach which is impressive in its own right I guess.
Anyhow, the above workout was done within the following undulating periodization scheme, which changed loading parameters weekly such that each was done once every month and 6 times throughout the study.
So every week the loading changed in an undulating fashion, with rest intervals varying (in what I would consider a generally rational fashion) along with it. Squats at 12-15 RM on 60 seconds are still pretty damn impossible but as Mike and I agreed, if anybody can do it it will be women. Men have to lay down for a bit after a true max set of 15 in the squat unless the weights used are trivial. Even a TRUE max set of 12RM will leave most on the floor.
If you are a dude and disagree with this PROVE ME WRONG. Send me a video of you doing 5X8-12 true rep max sets (the last rep should be a failure rep where you get stuck or need a spotter to complete the rep or so slow that you wouldn’t get another repetition without failing halfway) in squats with a 90 second rest interval without having to drop the weight by huge amounts each set and I’ll put it on the website and say I’m wrong publicly and send you a free book. It must be one single video with no cuts or edits and I’ll time the RI on every set by hand (I’ll give you an extra 10 seconds after bringing the bar out to start the set which means bring the bar out at 90 seconds). I won’t hold my breath on receiving these. I’ve done 2 by 15RM (I mean true failure sets where the last rep is a death grinder) and it took me 3-5 minutes between them to be close to recovered. 90 seconds is impossible.
Sets of 12-15RM on 30 seconds, well…. A woman might do repeat sets of 12RM on 30 seconds, maybe, but quality is likely to suffer. Multiple sets of 15RM on 30 seconds? No way. Mind you, it was also only 6 workouts out of the 24 (so unlike the entirety of Brad’s paper, only 6 workouts were low-quality junk-volume training). As well it only gets impossible at the higher volumes of training. G5 was doing 2 sets of 12-15RM on 30-60 seconds which is doable for women. G10 was doing 4 sets of bench or squat with 12-15RM on 30-60 seconds which is potentially; achievable but approaching problematic. G20 was doing 7 sets of bench and squat with 12-15RM on 30-60 seconds? No way in hell.
But the 3-4 minutes for 4-6 RM, 1-2 minutes for 10-12 RM and 2-3 minutes for 6-8 RM are firmly within reason to allow quality training to be done and sufficient loads to be used on each work set. I don’t know that I think this training structure is optimal for hypertrophy or particularly representative of how most train but there ya’ have it. That’s what they did.
This certainly isn’t how I would have set up the workout to test this particular idea but they didn’t ask me. I’d have probably done multiple weeks in any rep range, perhaps within an undulating program (i.e. 3-4 weeks at 12-15RM, 3 weeks at 4-6RM, 3 weeks at 10-12RM) or whatever. Again, they didn’t ask me. The study did what it did and I’ll focus on what it actually did rather than what it might have done or should have done (or might do next time, haha…).
Irrespective of the somewhat goofy workout design, I’d like to note the following
Each muscle group was trained once a week and all sessions were supervised with a ratio of at least one supervisor to five trainees (34), by exercise specialists that were not involved in the study design.
Now exercise supervision is important simply because most people coast through workouts on their own and this is the only way to ensure something approximating equality of effort (and Brad’s papers always mention the workouts beings supervised which is one of the few things he gets right). I’ll come back to this below. But check out the last half of the design, the people administering the workouts weren’t aware of what was being tested or even studied. Presumably they were just told “Put these folks through this workout.” So they blinded the exercise specialists along with the Ultrasound tech. Once again, you can do good science if you try.
So the changes were analyzed with, so far as I can tell, only standard Frequentist methods, the whole P-value thing and the standard cutoff of P < 0.05 was used. And just as I did with Brad’s paper, since they choose this value, they have to live and die by it. Yes, I am aware of the current paper calling for the dissolution of strict binary P values but I don’t care right now since it’s not relevant. It’s just 800 scientists (roughly zero percent of the worldwide total when you consider the millions of researchers), including Brad of course, looking for another way to get more shit science published.
I am also aware of the problematic issues with P value in this regard so spare me that criticism or the argument since it’s not relevant and I do not care. Until P value is changed or eliminated, any study CHOOSING to use it has to live and die by that choice. Either the results were significant or they weren’t by the chosen metric. This isn’t up to interpretation or after the fact statistical bullshit. If Brad and James didn’t agree with strict P values, they shouldn’t have used one (and don’t get me started on Measurement Based Inference which is too shitty for MSSE to even allow). Since they did, they live and die by it. They can’t have it both ways. Neither can this paper.
Mike’s Notes: So during the debate, I looked in detail at the two statistical methods Brad’s paper used which were Frequentist (P value) and Bayesian. As Brian Bucher showed clearly and I described, the P values gave NO support for 5 sets being better than 3 and the Bayesian factors gave only weak support for 5 being better than three and there in only two of the four muscles. The stats were weak as hell even if James is STILL defending the with more deflections. Mike still felt this supported the strongly worded conclusion along with a basic “Well, all statistical methods have limitations and in hindsight” and the same basic argument as for the blinding (since other papers are sloppy, we can’t judge this one for being sloppy). Yes, it probably would have been better for this paper to have used more than just P value for stats, given the limitations of what method. But just as I did for Brad’s paper, I’ll hold it to what they used in the paper, the cutoff they choose and look at it under that light.
Because nothing else is relevant in terms of what other papers did or didn’t do or what should have been done or might be done with replication. Even if they did an additional analysis after the fact I wouldn’t care what it said. If it’s not in the published paper it doesn’t count. I don’t get to do a different analysis of this data myself if I don’t like it. The researchers picked their poison and they have to stand by the results.
On that topic let me note that Krieger is still trying to throw other analyses at the data (the newest one is equivalency testing or some such) to make it happen. First this means that the stats in the paper clearly DO NOT support the conclusion. Otherwise he wouldn’t still be trying to show that the study was good. Second, this is irrelevant after the fact bullshit. If he wanted to do that analysis, it should have been IN THE FUCKING PAPER. He had plenty of time to manipulate the stats during the first draft but clearly he just figured he wouldn’t get caught out passing off bullshit. He did so now he’ll just keep reanalyzing the data until he can make it work. But he had his chance and it failed. Period.
And what Barbalho et. al. did was a basic Frequentist analysis with a P cutoff of <0.05. So that’s the only thing that will be examined or considered.
In analyzing the data, they state
Estimated marginal means were calculated for the change in outcome measures and within groups changes were determined by examination of the 95% confidence intervals (CI) for these. Significant change within the group was considered to have occurred if the 95% CIs for changes did not cross zero.
This is a little dense but basically they looked at the data with 95% confidence intervals (essentially describing the range of values that will include 95% of the results and note again this is the same 5% probability of the P value). As well, it only counted if that CI didn’t cross zero (i.e. no gains). Basically they were limiting their data to a standard CI but if that included a no gain result, it wasn’t considered significant no matter what. If even one person got zero gain, they weren’t going to conclude anything about the results. No group had a result that crossed zero but this is basically reality checking the study results and setting a further real world limit on what they considered meaningful. Nobody could experience zero results and the ES had to be greater than 0.6. Otherwise the results weren’t considered real-world relevant or important (even IF they had a P<0.05).
For better visualization of the data, they used what is called a multi-paired estimation plot which comes from something called estimation statistics. As you’ll see it shows each individual response along with the group mean and other data which just makes it easier to see what happened compared to other older methods. More studies should use this since it clearly shows both the individual and group results with the 95% CI’s.
And this is where the results are, I will admit, surprising. First I’m just going to look at the results and the analysis before trying to explain them in any meaningful way or address why they do seem contradictory to basically everything that we think we know.
First the strength data. For all exercises tested (bench press, lat pulldown, leg press, and SLDL) there were significant differences between groups and they did pair-wise comparisons on all exercises (i.e. 5 to 10 sets, 5 to 15 sets, 10 to 15 sets, etc). I won’t bore you with all of the inter-group differences since it’s a lot of “this is bigger than this but equal to that or the other” and it gets super dense and pointlessly complicated to repeat it over and over. And this is even more true since the overall picture for all tested exercises was basically identical irrespctive of some minor differences. Read the paper if you really care.
Specifically, G5 and G10 were not statistically different from one another but, in many cases, either one or both were GREATER than G15 and G20 (in some cases G10 was the same as G15 statistically but G20 was still lower and, trust me, this doesn’t matter big picture). Put differently, by the statistics they choose, G5 did just as well as G10 (or G10 did no better than G5 depending on how you want to look at it) but both usually gave a superior response to the G15 and G20 groups. More volume beyond ’10’ sets not only didn’t give more gains but often gave LESS. You can see this in Figure 1 below.
The above is the multi-paired estimation plot I mentioned and you can see that it shows the individual response of every subject (the lines showing pre and post 10RM strength) along with the average response (black dot) and confidence intervals (vertical bars). It makes it really easy to not only see the changes but compare the average changes in strength.
And it’s fairly clear to see that while everybody made gains, the G5 and G10 were not noticeably different and both were generally superior to G15 and G20 (it’s really easy to see on leg press in the bottom left corner). Once again, not only were the two lowest volumes not significantly different, the two higher volumes were generally worse. So looking at bench the changes were roughly 12.5 kg, 13.5 kg, 11.5 kg and 5kg for G5, G10, G15 and G20 respectively. For all practical purposes 5,10 and 15 were the same (11.5-13.5kg) but 20 sets was less than half of any of them.
And inasmuch as strength gains and muscle size gains tend to be strongly correlated (when the growth is actually myofibrillar), it will come as no surprise that the changes in muscle thickness followed basically an identical pattern. Just as with the strength gains, there were statistically significant group differences so everything went to pair-wise comparison.
And again I’ll spare you the overall details since the changes were essentially identical to the strength data: G5 and G10 were not statistically significantly different from one another although both were generally superior to G15 and G20 (sometimes G15 was the same as G10 and sometimes it wasn’t but it doesn’t matter big picture). A moderate volume was no better than a low volume but the two higher volume groups had a generally worse response. This is figure 2 with the same kind of data plot as above.
Once again, you can see the individual responses along with the mean response and 95% confidence interval. And visually, as with the statistics, G5 and G10 are not meaningfully different but G10 and G20 gave a worse response. Now if we took the set count at face value (and the folks who only read the abstract did this), that means that 5 sets per week was sufficient for growth with 10 being no better and 15 and 20 being worse. And this seems hard to believe at face value given what we know or think we know about women and training (which is that generally they need more volume than men). A mere 5 sets per week for a maximal response?
But it’s even more surprising when you remember that the 5,10,15 and 20 set counts aren’t right to begin with. If you use a 1:1 counting method, only biceps and triceps achieve those values while pecs and quads hit 4,8,10 and 14 (and glutes, well….?) Which means that 4 sets for chest and quads gives the same growth and strength response as 8 sets (and 10 and 14 give a worse response). And that 5 sets for biceps and triceps gives the same response as 10 sets with 15 and 20 being worse. If you count the arms rationally as 0.5:1, the set counts drop to 2.5, 5, 7.5 and 10 with 2.5 sets giving the same response as 5 and 7.5/10 having a worse response. These are low volumes of training to give that type of maximal response. A mere 2.5 sets for biceps per week with even 5 being no better?
Given what we know about women, what most believe about training women, what I believe about training women, this is truly contradictory. Hence Mike’s deflection of “Does Barbalho not also contradict you???? Then why have you not also taken it apart?” (mind you, it contradicts ALL OF THEIR high volume prattling and I haven’t seen them addressing it in detail as to how they can ignore it). It’s still irrelevant in the context of the debate but here we are. Bored yet?
Discussion: Part 1
Ok, so what’s going on. First let me look at what the researchers themselves had to say about these results along with their examination of other studies on the topic. At the start of the discussion, the researchers write this:
To the best of our knowledge, this is the first study to compare different RT volumes in trained women for a relatively long period (24 weeks) and our results suggest that five sets per week might be adequate to promote optimal adaptations in terms of muscle size and performance in most outcomes. Moreover, our results suggest that increasing training volume beyond 10 sets per week might be detrimental to muscle performance and hypertrophy
Which is consistent with the study, the data, the statistical analysis and the results. They took their data and statistics at face value and reported what they found. But I’d make another point: pay attention to the language. “To the best of our knowledge”….”results suggest”…”might be adequate”…. See all of those qualifying words? How the conclusion is stated in a suggestive and guarded way which is how actual scientists talk. It’s also interesting in that it is not exactly what it says in the abstract which says:
Five to 10 sets per week might be sufficient for attaining gains in muscle size and strength in trained women during a 24-week RT program.
But this seems to be a common problem lately, where the abstract says something subtly different from the paper itself. It’s always a good way to see who read the paper, mind you: did they repeat what the abstract said or what the paper actually found? Though note still the use of qualified language and the word “might”. But statistically (and in absolute terms), 5 sets was as good as 10 so it’s weird that they would even include 10 sets in the abstract.
I can’t speak to their logic but presumably they were comparing 5 and 10 (which were identical) to 15 and 20 which were generally poorer. That’s my guess. But it is contradicting to the paper’s actual results where 10 was no better than 5 statistically. Point against the paper but only a real issue for those many who never get past the abstract to begin with.
Now compare that to the conclusion from Brad’s study:
“Alternatively, we show how that increases in muscle hypertrophy follow a dose-response relationship, with increasingly greater gains achieved with higher training volumes. Thus, those seeking to maximize muscular growth need to allot a greater amount of weekly time to achieve this goal. “
What words do you not see? Suggest, might, best of our knowledge or any word that might qualify the conclusion in any form or fashion. This was a study where the statistics flatly did NOT support that conclusion and yes I am going to beat this fucking dead horse into the ground. The P value offered NO support for the conclusions and the BF10 values were weak/anecdotal/not worth a bare mention for ONLY 2 of 4 muscles and yet Brad is talking in damn near absolutes, just like he and James both did online (and Mike HILARIOUSLY defended as NOT BEING A STRONG conclusion somehow). They didn’t even qualify that the weak-ass support was ONLY for the lower body in their conclusion. They make it sound like it applies across the board in training.
“We show that”…”increasingly greater gains….those seeking ….need to allot.” He is claiming, against his own statistics, that they showed that more volume meant more growth and if you want maximum m growth, you need to put in more time. There is not a guarded or qualifying word in there and I give 3/5ths of a shit what they wrote after they got caught out on it. What they wrote in the paper is bullshit: they didn’t show any such thing. How they announced it online initially was bullshit: the paper didn’t show what they claimed it showed. Mike’s attempts to call this not a strong conclusion because it didn’t give specific set counts is bullshit.
And here’s a protip: REAL scientists don’t talk like that, in absolutes without qualification. Science is suggestive as Mike so helpfully pointed out. And yet even in his papers, Brad oversells his results (it’s worse online and James does it too). And in this case, results that his data and statistics didn’t support to begin with. But I’m off topic again.
My point is that Barbalho, in addition to doing an exponentially more rigorous study actually drew a guarded conclusion consistent with the data and their own statistics. Brad really should do an internship with those guys and learn how to do fucking science properly AND write up his papers. Learn how to blind, get enough subjects, wait long enough to do the Ultrasound (and NOT do it himself), actually make guarded conclusions that match your statistical analysis. You know, do good science rather than churn out shit and get butthurt when he gets called out on it.
At best, AT BEST, Brad might have concluded “Our data very weakly suggests that, for the lower body only, a higher volume of training may lead to a slightly greater growth response.” Because that’s all the data and statistics could have possibly supported and I’m being generous in that since a Bf10 of 3 still means jack shit. Again, I offer my writing services to help him with this since he’s clearly incapable of doing it himself.
But beyond that, what is going on?
Discussion: Part 2
First in the discussion, they compare the study’s results to other work. They repeat the results of Haas (comparing 3-9 sets with no difference) mentioning also that the dropout rate in the higher volume group was 25% compared to zero for the lower volume. Remember that an important issue for many people is time requirements and if you can get the same gains with less, you may be more likely to keep training. This is rarely an issue for driven athletes but is a huge issue for the general public.
They also mention a paper by Rhea where recreationally trained men did either 1 or 3 sets of bench and leg press three times per week for 12 weeks (so 3 or 9 sets per week again). It found no difference for bench but leg press did improve more with the higher volumes. This is consistent with the idea and some data that lower body may need or respond to higher volumes. This needs more systematic study.
Next up is Ostrowski which they got both wrong and right. They state that there were three groups performing 3, 6 or 12 sets per week with the changes in strength and muscle thickness being identical (and statistically remember that this WAS true). But only legs was 3,6 and 12 sets while triceps was 7, 14 and 28 so what they wrote is not strictly correct. I am at a loss as to why I seem to be the only one who can describe this paper correctly in either it’s methods or results.
It doesn’t actually matter in any case since, as the statistics show, there were no differences in groups beyond the lower volumes (3 and 7 sets accurately) and the higher volumes which is how they reported it. They didn’t try to change the conclusion that Ostrowski actually made since the actual results actually matched theirs. They reported it as there being NO BENEFIT to higher volumes which is what it actually said rather than parsing/reversing it to say that the highest volumes gave the best results.
So this was similar to their results (5 sets giving the maximal response that occurred). Statistically, the lowest volumes worked just as well as the higher volumes even if there was a visible trend in the raw data (remember, if you choose a P<0.0.5 you have to live with it). They do mention that in the Ostrowski paper there was a negative change in the testosterone/cortisol ratio in the highest volume groups, which is at least one potential indicator of overtraining and might potentially explain their results for G15 and G20. The workload was simply too much with that many sets taken to failure. I’ll come back to this.
In this vein they also mention a rat study where 3-5 sets provided the maximal stimulus to muscle protein synthesis with no further increase at 10-20 sets. But rats are not humans and I don’t pay much attention to animal studies so this is at best interesting and at worst meaningless. Until it’s replicated in humanoids, I just don’t care about animal research and refuse to cite it conveniently when it suits me.
Certainly it is likely that there is some per workout threshold above which more sets don’t stimulate growth but there is almost no human data (outside of a study by Burd comparing 1 and 3 sets where 3 sets was superior which tells us nothing about higher per workout volumes). And this is also data we need. I’m told (and mentioned this previously) that James Krieger has done a new analysis showing that 7-10 sets per workout is about the maximum (with some range) before no greater stimulus occurs. Which is interesting because, if correct, would contradict the results of THE PAPER THAT HE AND BRAD CONTINUE TO DEFEND (and I’d be amused to see James rationalize that contradiction). That said, at this point I’m not sure I’d trust him to analyze my grocery list since he’d just change statistical methods after the fact when he didn’t like the results.
They also mention the 6 week long GVT study which found that lower volumes were as if not more effective than higher volumes with some suggestion that excessive volumes may do more harm than good. They didn’t mention the 12-week followup but I don’t know why. Since the results were identical in both studies with higher volume havnig no benefit, it kind of doesn’t matter. Both were in general agreement with their results.
In this vein, they cite the old classic Wernbom review which did show that 40-70 reps per workout or so gave the optimal per workout response with both lower AND HIGHER reps giving a worse response. As I’ve shown repeatedly, at 8-12 reps per set 40-70 reps is 4-9 sets per workout or so. They found 5 sets gave the best response, admittedly within changing repetition ranges per week, which is at least roughly consistent with this.
Note that this is PER WORKOUT volume and I will come back to this below.
They next mention that their results do contradict meta-analyses on the topic citing Schoenfeld et. al. paper along with Ralston (I think this is the paper that used total upper body set count rather than per muscle or per exercise which is asinine) suggesting a dose-response relationship up to 10 sets. But they also state :
However, the use of meta-analysis for determining RT dose has been questioned due to the large number of variables involved in RT and the methodological inconsistencies in the literature.
Citing Gentil and Arruda. Now Gentil is on the Arruda paper along with the paper I’m over-analyzing now and from memory his group (along with Steele and Fisher) have been on the opposite side of this issue from Brad et. al for quite some time on this and many topics. For example, in response to the Arruda letter to the editor, Brad and his crew submitted this The dose-response relationship between resistance training volume and muscle hypertrophy: are there really still any doubts?
Mike note: One of Mike’s ‘defenses’ of Brad’s doing the Ultrasound unblinded is that Brad isn’t biased (which is hilarious because everyone has bias) because he ended up contradicting his own beliefs about volume and was ‘surprised’ by his (non) results in the recent paper. And yet check the title of that letter to the editor. Brad has NO DOUBT about the dose response relationship. NO DOUBT. That’s fucking bias, Mike. Scientists always have doubt. Brad has none. He’s clearly just setting out to prove what there is NO DOUBT IN HIS mind is true: there is a dose-response with volume and hypertrophy.
Why else would he crow about and defend a paper who’s statistics didn’t support his conclusion, with guru tactics he would never let anybody get away with? Because he’s biased, Mike. That’s why. Nobody is unbiased (the Fair Witness only exists in fiction, look it up nerds) and anybody claiming that they or anybody else is unbiased is lying to themselves and everybody else. Brad is biased, James is biased, you’re biased and I’m biased and so is every other human being on the planet. Yet seemingly I’m the only one honest enough to admit that I’m biased or attempt to address my own biases (at least from time to time).
My point being that these groups are at certainly at odds with one another and probably both make good points. Go read the papers for some boring tedious debate. The point being that the results of two meta-analyses contradict Barbalho’s paper. The other being that Brad has NO DOUBT about the relationship of volume and growth.
At which point Barbalho examined their results within the context of Brad’s volume paper which, like Ostrowski they get a bit wrong. They state that for triceps the increase for 30 sets was higher than 6 sets which is wrong (it didn’t even reach pair-wise comparison by P value or BF10). They also state that VL And RF WAS higher for 45 vs. 9 sets which is also not statistically supported by P value or BF10 (the moderate group at 27 sets was better than the lower set group but the 45 set group was no better than 27 sets).
I can’t explain this error except to say that maybe they understand Bayes Factors even less than James (who is still trying to make a value less than 3 not be meaningless, again with a boldfaced lie about there being sources, which he has yet to provide, saying 3 is not meaningless). But it also doesn’t really matter since they weren’t reporting this in such a way to make it agree with them (as Brad did in fact do in misreporting the Ostrowski data). They absolutely incorrectly reported it as disagreeing with them based on the highest to lowest volumes but even if they had correctly reported that the moderate volumes were superior to the lowest volumes it would still contradict them since Brad’s moderate set count was so much higher than theirs to begin with. It was a mistake but changes nothing about the reporting which was that the results disagreed with theirs.
Which is a long way of saying that their mistake is not like Brad’s ‘misrepresentation’ which factually reversed the actual results of the paper from disagreeing with his to agreeing with it.
Discussion: Part 3
Which leads into potential examinations of why. The first factor they examine, which I’ll come back to was the difference in training frequency, stating:
Thus, the spreading of such extreme volumes over multiple sessions may yield benefits whereas the completion of such volumes within single sessions may not.
And within the context of the Wernbom analysis and Jame’s supposed analysis, this is important. If roughly 5-7 sets per workout (or even 10) gives the maximum response, doing 15-20 in a single workout isn’t useful, the extra is just junk volume which is at, best, neutral to gains and, at worst, detrimental. If you consider that Brad’s moderate volume groups did just as well as the high and take those 27 lower body sets across 3 days/week that’s 9 sets/workout. If you rationally math it from compound to quads, the leg volume drops to 18 sets/week and 6 sets/workout. Hmm, right within range of what might be an optimal per workout volume. How very odd.
They also discuss that while their subjects were used to training to failure, the majority of subjects in Brad’s study did NOT routinely train to failure. Mind you, I still do not buy for a second that any of Brad’s subjects did 5X8-12RM in the squat on a short rest interval with anything but trivial weights and I guarantee that 99% of the folks reading this have NEVER squatted to failure deliberately but no matter (as I said above, feel free to prove me wrong with video).
They mention that many stop due to discomfort before failure, especially for lower body and when you’re not training to limits, perhaps more volume is needed. I won’t disagree and certainly I’d give someone more volume stopping 2-3 reps short of failure than going to failure. Not a billion more sets to compensate but a few more to be sure. The 3 sets to true limits might need 5-6 with a rep or two in the tank.
The researchers then repeat the conclusion which is this:
In conclusion, the present results suggest that as little as five sets per week might be sufficient for attaining optimal gains in muscle strength and size in trained women during a 24- week RT program, at least when all sets are closely supervised and performed to muscle failure [my emphasis]. Since lack of time is a commonly cited barrier to exercise adoption (50, 51), our data supports training programs that are uncomplicated and time efficient.
Noting again the use of qualified language: suggest, might, supports. And reaching a conclusion that the statistics supported as correct. This is how you both do good science and write about it.
So what’s going on here? Why does this one study on “trained” women (I’ll explain the quote marks in a moment) seem to truly contradict what most in the field believe about women’s training. Is it truly possible that this low volume per week could be enough for maximal gains in strength and muscle mass in “trained women”? Are we all stupid and wrong? Well the results are certainly the results and I won’t dismiss them out of hand. But I think there are possible explanations for the results that allow us all (including myself) to avoid the implications of this study. Because isn’t that what it’s all about because we’re all biased to one degree or another?
Low Training Frequency
The researchers themselves brought this one up. Quite in fact, even before I read the paper in full, it was the first thought I had when I saw the paper and it’s results: that doing all the volume on one day, especially at the higher end, might be masking the true results from different weekly volumes. Let me explain.
Let’s assume that there is in fact some optimal per-workout training volume above which no further benefit occurs. There is at least some data to support the idea. Wernbom identified it ages ago, there is the supposed analysis by James, and some other semi-supporting work (even if we need far more specific research). Even most of the studies which absolutely support moderate weekly volumes point to the same thing: there is a per workout volume above which there are no further changes (except maybe in body water).
And this would make complete logical sense although it’s usually dangerous to try to figure out physiology with logic. But we know that in other situations (such as stimulation of bone with impact loading), there is a point above which the tissue becomes refractory to further stimulation. Above a certain volume of endurance training or HIIT, no further benefit is seen. It’s hard to see how muscle would NOT act that way: above some optimal amount of contractions, there is no further stimulus to growth with the effect might eventually becoming negative at some point.
The Pilot Study we Need and Deserve
I really want to see someone do a pilot study on this. Again, I don’t ‘do science’ but this seems trivial in premise even if it’s very complex in practice. Take a bunch of trainees. Have them train one arm or leg with 3 sets, the other with 6, on one day. Bring ’em back a week later and have them do 9 and 12. Maybe go 15 and 18 just to see. Measure protein synthesis each time. Ideally measure protein breakdown (which is a lot harder). See where the changes occur in both measures for different volumes.
Maybe you see that protein synthesis goes up to 9 sets/workout but above that you get no further increase in synthesis (a plateau effect) but a big increase in breakdown. Or at 12 you’re getting closer to a limit or balance. And at 15 it switches from an optimal stimulus to excessive breakdown so the net effect is now lower than at 9 or 12 sets or whatever it turns out to be. Say it turns out to be 8 sets per workout on average where you hit that cutoff point. Start THERE when you set up the next set of studies rather than picking this assinine sets/exercise and letting the volumes fall across an order of magnitude difference. I’m not saying this is cheap or easy methodologically but the overall design and point seems simple as hell to me. And we need this data.
Back to Barbalho et. al
But let’s start with this assumption, there is a maximum/optimum per workout volume which I think is a safe one. Now we combine that with the generally accepted (and research supported) fact that women do recover more quickly than men (in general). Yes, fine Brad might have shown that men get the same growth doing the same volume on one vs. multiple days (for lower volumes of training and assuming thoes papers aren’t as methodologically shit as his recent one and I really should look at it again) but we can’t de facto apply that to women.
If recovery is faster in women, you may need more frequent training to maximize growth. I’d note that Barbalho do mention that recovery between workouts takes 4 days but even that allows for a 2X/week training frequency per muscle group with a split routine.
And what I suspect is that if the researchers had split the volume per muscle group (kind of) across two days, the results might have been different (you’ll note that I too use guarded language, unlike Brad). That is, 5 sets done twice weekly would be different (and I suspect superior) to 10 sets done once per week. And 10 sets twice/week might (or might not) be even better than that. Clearly in this study, more than 10 sets/workout generated inferior results so it would seem pointless to study per workout volumes above that. If you want to test more than 20 sets/week, you’d have to do it across three days/week (i.e. 2, 5 and 7 sets/workout 3X/week) or something to compare 6, 15 and 31 sets/week or whatever.
So next study they do, on top of picking a workout structure that is not completely fucked in terms of structure and set count, compare 3, 5, 7.5 or 10 sets done twice a week to get the same 6,10,15 and 20 sets per week. Same volume but distributed differently/at a higher frequency. Or go full body 3X/week and compare 2, 3, 5 and 6 sets per workout to get 6, 9, 15 and 18 sets per week. Again, I suspect that the results would be different with a more distributed volume and I think most would agree with me. I am also prepared to be wrong about this though I doubt I would be. Until we have the data, it’s speculation only.
But what’s the second reason I think the results are what they are?
Were the Women Truly Trained?
Way up above, I mentioned that the women were selected on having at least 3 year of consistent training experience, reporting that they did 18-24 sets/week across one or two weekly workouts. This is not fundamentally different than the subjects in the studies done on men in terms of the duration of training where training age tends to be in the 1-4 year range or so.
But I strongly question if this is true and that they are actually “trained” and I will base it on the same type of analysis I did when I looked at the men’s studies: relative strength levels. This isn’t perfect but gives a generally decent idea of how “trained” someone is (and yes I realize that if you don’t practice low reps your 1RM might not be as high as you’d expect for your training age).
If you go back and look at Figure 1 the average 10RM starting bench press ranged from 20-30 kg which is 44-66 lbs. If we take 10RM to be about 75% of 1RM (and this might be slightly different for women but the differences won’t be huge), this predicts a 1RM bench of 60-88 lbs. We all know that women struggle with bench but this is incredibly low after 3 years of training. These were ~64 kg women (about 140 lbs) and this is a 0.5-0.6 bodyweight bench. By most online strength standards (for example), this is a novice bench press value. Not a beginner but not well trained by any stretch.
It’s not really meaningful to examine the lat pulldown (15-30kg 10RM) or leg press (70-80 kg 10RM which is 154-176 lbs) since machines differ and it would have been nice to know their squat numbers but it wasn’t tested. But the SLDL numbers were in the 30-40 kg range (66-88 lbs) for a predicted 1RM of 88-117 lbs, a value I’d also consider to be very low for women of that bodyweight. 88 lbs after 3 years of progressive training is low.
Perhaps of more importance are the actual strength gains that occurred over the 6 months of the study. Look up at Figure 1 again. On the bench press, the initial 10RM is 20-30 kg, let’s call it 25kg. And the strength gain over 6 months was an average of 12.5 kg, a 50% improvement. The same holds for SLDL where the average gain for the 5 set group was just over 20kg or half of where they started. The leg press started at 70-80kg and the average gain for G5 was 40 kg. Lat pulldown was a 12 kg gain on a starting poundage of 15-25kg. So it was a 50% gain across the board on all 4 tested exercises. In 6 months.
So we ask, would you expect women who were actually well trained to make those kinds of strength gains? Hell, who among of who is trained wouldn’t kill for a 50% improvement in 1RM or 10RM strength in any time frame much less 6 months. And that just doesn’t happen after 3 years of consistent training. The bench press improvements actually take the women out of the novice category into the intermediate with an average estimated 1RM of 110 lbs (37.5 kg * 2.2 = 82.5 lb 10RM / 0.75 = 110 lbs 1RM which is about 0.8xBW). So in 3 years they went from untrained to novice and in 6 months reached intermediate levels of training.
Now, there could be other things going on. The training protocol, goofy as it was, used some low rep work and this is not necessarily a common approach for women trainees. By bumping up their maximum strength with those lower repetition training ranges, they might have made some big improvements in their 10RM. Unfortunately, no more information was given about the women’s training except weekly set count and frequency. It would not surprise me if this was their first exposure to low repetition work. But that’s a guess.
Originally I was going to argue that the women had spent 3 years simply faffing about (it’s a British term, look it up) and just doing a lot of work without accomplishing a whole hell of a lot. Studies routinely find that women often self-select loads well below what is recommending for making gains). Mind you, men do it too, self-selecting weights that are too light to be meaningful. This is why exercise studies use supervisors: otherwise people just fuck around intensity wise.
Being in a study, being supervised and being pushed to limits with low repetition work for possibly the first time would also explain a 50% improvement in 6 months over what they had achieved in 3 full years of training. I’m not bothering to find the paper, but simply having a trainer present (standing next to you) tends to make people work harder (a hilarious recent paper found that men were more likely to be impacted by this while women were not). Being supervised to work hard/to limits for, I would contend, the first time ever, can drive enormous gains.
The researchers even address this in the formal conclusion of the paper where they write:
In conclusion, the present results suggest that as little as five sets per week might be sufficient for attaining optimal gains in muscle strength and size in trained women during a 24- week RT program, at least when all sets are closely supervised and performed to muscle failure (my emphasis).
Basically if you get worked to limits, low volumes may be sufficient and my initial assumption based on the description was the the women were not used to working that hard but that may be wrong. In a throwaway sentence I almost missed in the discussion, they state that the women had “experience of such training [to failure]”. Which isn’t very helpful in terms of what they did, or how often or whatever. But suggests that they were not in fact just faffing about in the gym the whole time. But I’m still not sure I buy it at face value.
I’ve watched too many women train for too many years and the number going that hard on a consistent basis are in the minority (shit, I’ve watched too many men train to believe the claims of everyone on the Internet that they work harder than any 10 trainees combined). I certainly doubt that they got 40 women who had trained anywhere near their limits for this study. And I think both the initial strength levels and enormous strength gains over the 6 months (indirectly) supports that contention. What most trainees think are their limits aren’t even close. And under supervision (and low reps to limits), these women made staggering gains in 6 months compared to what they had achieved in 3 years to that point.
But it may simply have been that the reason the low volumes were sufficient was two fold. First and foremost, the women simply weren’t that well trained to begin with. I mean, you can be in the gym for 3 years farting around (whether male or female) and not be well trained. The initial strength levels kind of speak to that: women training hard for 3 years should be lifting heavier weights than they were even for sets of 10. And then when you start working hard for the first time, you make serious gains: I mean, 50% gains in 6 months. That’s huge.
But we know that the less well trained you are, the more rapid your gains when you start pushing hard. No, these women weren’t untrained but their numbers on bench were distinctly novice. And relatively untrained individuals need very little volume to progress. The Hass study they cite found exactly that: women with 1 year of training to failure still made big gains with only 1 set taken to true failure. True failure being the key. Again, if you keep a few reps in reserve, you need more sets and I won’t argue that at all.
And I think this also explains why the higher volumes were at best no better or, at worse, err worse. When you’re at best moderately trained and not used to pushing hard, high volumes of failure work will overwhelm your ability to recover. Even with my set count remathing, the volumes of truly heavy work (especially on one day of training) was pretty high for the higher volume groups.
So, during the lowest rep week, consider that G5 would have done 4 sets of 4-6RM for chest which is a pretty good training load. As I’ve pointed out before, 5X5 was created by Bill Starr averaging out an early study where 4-6 sets of 4-6RM gave the best strength gains. At G10 you’re up at 8 sets of 4-6RM which is heavy but doable. At G15 and G20 you’re at 10 and 14 sets of 4-6RM for chest in that workout respectively which is a staggering workload. Most advanced trainees couldn’t survive that.
And I honestly think that provides a much more logical explanation of what was going on here. The women had ‘trained’ for 3 years but weren’t even remotely well trained based on their strength levels/strength gains. And when you are being pushed to failure under those conditions, it simply doesn’t take a high volume to get absolutely staggering strength gains (which they did) in that population which explains why even G10 didn’t get much better results. And too much volume at that level of intensity will simply overwhelm anybody. And inasmuch as progressive strength gains over time drives growth (since it is and will always be the PRIMARY driver of growth), the same low volumes driving up moderate rep strength, gave similar growth results.
I’d still like to see a followup with a split volume. 10+ sets of 4-6RM once/week on bench is a staggering workload but I bet 5 sets of 4-6RM twice a week is survivable and would generate better results. It’s the same 10 sets/week but distributed far more rationally. The 14 sets of 4-6RM in G20 at 7 set twice/week is a heavy but achievable workload although 7 vs. 5 sets isn’t much of a difference. But those numbers would be completely consistent with meta-analyses of volume and strength gain (admittedly mostly done on men) such as the one by Rhea along with real world training practices.
Once again, I could be absolutely completely wrong about the above and when more research in trained women comes out, the model will be further developed. When and if that happens, I’ll make a point to correct myself. Let me note in that regard: I seem to be the only one in this entire fucking industry even willing to consider that he might be wrong. And do so ahead of the fact. Let’s see a single guru do that. Just one.
Should we Change Everything We Do with Women?
Which, to answer the question that Mike would have asked me and someone will ask: Will I now be changing how I approach training women? Well nope on two counts. In the first place, I’m all about lower volume training for relatively less trained individuals which I think these women were. You don’t need that much to progress when you’re a novice. So my novice recommendations don’t change. A few heavy sets twice per week close to or to failure is more than plenty at that level of training and I think far more trainees would benefit from doing less faffing about and a lower volume of challenging work. It works better than doing all the volume when it’s junk.
So far as more well trained women, well I could argue that this paper doesn’t apply to them in the first place since I don’t think these women were particularly well trained (perhaps this represents my bias but if you can show me a truly trained woman who can make 50% gains in 6 months, please do). And I’d stand by that to some degree.
Even if we accept these women as more than novice trainees, it’s still only the one paper (as Mike liked to harp on about Brad’s paper). Not only that it’s the ONLY paper on the topic to date. Maybe the results are absolutely correct, maybe they will be replicated. Maybe they won’t be. Time will tell. But if Mike can play the “It’s only one paper” to excuse why the evidence based crowd is basically ignoring Brad’s shitty results (which was ONE paper contradicted by 6 with Raedelli being a shit show), then that applies to this situation where it is the ONLY paper on the topic. He can’t have it both ways. You played yourself, Mike.
Mike has Problems with Numbers and Consistency
And before you call hypocrisy let me point out another amusing thing from the debate that many might have missed (I did until I thought about it afterwards). The argument for why all the “evidence-based” folks weren’t applying Brad’s study is that it’s only one paper. To that I’d add that the results didn’t support the claims to begin with which is a better reason to fucking ignore it. And would have been a far more logical argument. But that would entail acknowledging that the results are in no way supported by the statistics.
Mike made that argument during my half of the debate that settled nothing. That I was overvaluing one paper. Which is horseshit, I’m criticizing one paper based on the shitty methodology and a lie in the discussion. A subtle distinction but an important one. Meanwhile, James is STILL throwing new analyses at this trash fire to try and make it happen while others are being apologists for it. You want emotional attachment to a paper, that’s it.
But then if you listen carefully, when we move to the volume discussion in the second half where he’s arguing for high volumes, Mike suddenly changes to “But there are 3 out of 8 studies supporting higher volumes.” He asserts that those are Brad’s (not really based on their own statistics), Haun (not really unless you think doubling volume for 0.5 lbs of muscle is relevant) and Raedelli (which I maintain is garbage data unless you believe that beginners get ZERO growth over 6 months until they do 45 sets/week or whatever nonsense it supposedly found).
So which is it Mike? Is it one paper that we can ignore or 3 out of 8 that the evidence based circle-jerk is ignoring because they know Brad’s results are shit? And spare me the anecdotal bullshittery where James is getting supposedly good results from 45 sets and Menno has a woman doing 70 per week for back because he’s a shitty coach and she’s a shitty trainee. Anecdote isn’t science and never will be. Are you ignoring 1 or 3 papers? Or just changing your argument as it suits you and hoping nobody notices?
As importantly, why did Mike lower his template volumes based on ‘new research’ (i.e. his single study) if ‘it’s only one study’? I’m not expecting an answer. Just making the point. None of them can stay consistent day-to-day or hour to hour while I’ve been repeating the same tired bullshit for months now. Even I’m exhausted by it.
And now let’s try to wrap it up because I think there is another possible reason that this studies results seems so far out of whack from the others. To understand that, I need to do something I said I’d do earlier.
The Tale of the Tape: Barbalho vs. Schoenfeld
I want to start this section by collecting all the snide comment/comparisons I made about Barbalho’s methodology vs. Brad’s in a chart. US is Ultrasound and I’ve put the negative points in red.
|Sample size||Met a priori power analysis for subject number||Did not meet power analysis due to dropouts|
||6 months||8 weeks
|Blinding of US tech||Yes + uninvolved in study||Brad did it himself unblinded FFS|
|Time of US measurement||3-5 days after last workout||2-3 days after last workout|
|Menstrual cycle controlled
||No||Males so N/A|
||Dumb as fuck||Rational enough|
||Dumber than Brad’s
||Dumb given workout design|
|Exercise supervised||Yes, by specialists unrelated to study||Yes|
|Statistical Analysis||Frequentist with ES>0.6 and no nil result cutoffs||Frequentist, Bayesian (BF10 values) that were heartily ignored|
|Conclusion wording||Guarded and qualified||Unqualified in wording|
|Conclusion||Supported by statistics||Unsupported by statistics|
You happy, Mike?
First and foremost, the Barbalho study absolutely crushed Brad’s study on every level methodologically. I mean every one. Yes, it had problems because, as Mike so helpfully pointed out “All studies do” (another smokescreen deflection similar to what anti-science people use). They should have provided post-study anthropometrics, they should have tried to control for the menstrual cycle. I’m not arguing any of that. But in the aggregate, their paper crushed Brad’s in the factors that mattered more: duration of study, number of subjects, BLINDING the Ultrasound tech, actually drawing a conclusions consistent with their results/statistics.
So again I thank Mike for challenging me to this since it simply made my point: it is possible to do good science if you try. Brad either doesn’t know how or simply doesn’t care. I guess he doesn’t have to when he has a circle jerk of gurus to defend him with rank apologism and shitty deflections (and worse appeals to authority in MASS). Perhaps if Brad would intern with this group he could learn to do good science, James wouldn’t look the fool trying to use different analyses after the fact and Mike wouldn’t have to make pathetic excuses for all of it. We get better science and less rank apologism and guru antics after the fact. And I could write about other, more interesting things.
But beyond that, I think the differences in this study (and several others) might point to why the results are so different than other studies: actually having good methodology. Because it seems that when you have short studies, small samples, piss-poor measurement methodology (unblinded Ultrasound done too soon after the last workout) you get one result, at least sometimes. And when you do the opposite, you may get the opposite result. At least sometimes. This is a difficult conclusion to reach since few studies are particularly long and none I can recall actually lived up to this study’s methodological rigor except perhaps the Haun study Mike was involved with.
It’s not always true, Ostrowski and the GVT study were both short and found that lower volumes were just as good as higher. It’s just as possible that a longer GVT study would support higher volumes. It seems just as reasonable that Brad’s paper just found (not really statistically supported) insane results due to the combination of shit methodology it threw together. It’s possible that over 8 weeks the high volume group would have shown a statistically relevant finding. But since it wasn’t longer, we can’t know (and we can only analyze it BY THE STATISTICAL RESULTS IN THE PAPER).
Adding to this is Lucas Tafur’s piece linked above suggesting that studies less than 12-16 weeks may not be measuring actual muscle growth to begin with (note: the Haun study Mike was involved with brings this into question as direct biopsy DID show growth over 6 weeks although it turns out to have been sarcoplasmic proteins all along) but simply fluid shifts (other papers have suggested this as well) which do impact on US measurements. He’s also shown that there is a great likelihood that the edema from training is not gone by the time point that most studies measure. Add to this the unblinding issue which, given the realities of Ultrasound increase the risk of bias and you get a lot of factors that might be impacting this data set when the studies are short and methodological shit shows.
But realistically longer study durations with FUCKING BLINDED techs and long enough durations for post-workout measurements are going to give us far more valid information on all of this type of things. Yes, science is hard and this is difficult to do. Well, if you wanted an easier career, you should have picked something else then. But please stop making excuses for your inability to get basic simple stuff right.
Tangent: Brad Wants Better Methodology for Everyone Else
Somewhat hilariously, in a paper titles “A critical evaluation of the biological construct skeletal muscle hypertrophy: size matters but so does the measurement” just published in Frontiers of Physiology that just came out with Brad’s name on it, they write:
Additionally, ultrasound is highly dependent on the skill of the investigator, given that differences in the pressure exerted by the transducer against the skin can result in substantial variation in measurements and thus high inter-rate error rates. Thus, ultrasound-based assessments of muscle thickness provide a fast and practical assessment of [1 dimensional] muscle size, but the QUALITY OF THESE ASSESSMENTS MAY VERY WELL BE RATER-DEPENDENT (emphasis mine).
Simply, Ultrasound measurements can be impacted by how the measurement is done on top of having a subjective component to begin with. Add to that an unblinded tech with a bias that more volume means more growth and well…bias + unblinding + subjective component equals what now? Biased results is what.
And beating that dead horse, here is a hilarious paper that came out a little while ago titled How to prove that your therapy is effective, even when it is not: a guideline. Which is either an attempt to make a point about the little games that researchers play to get garbage science published or a how-to guide for sloppy science. And even though it’s about psychology, well the games researchers play still apply. It states, among other amusing things:
Another weak spot you can use to influence the out-comes of your trial is to use non-blinded raters of clinical assessments of outcome….Predictably, studies with proper blinding also resulted in smaller effect sizes for psychotherapy for depression(Cuijperset al.2010b).
And isn’t that interesting? When you blind studies (i.e. adhere to good scientific standards), suddenly your amazingly huge results seem to disappear because you reduce bias. And even though this is psychology, I am willing to bet it holds across all fields. Scientific guidelines exist for a reason, we blind for a reason and not blinding is sloppy science no matter how you try to excuse it. Making excuses for it after the fact is just pathetic when you clearly know better. And exercise science shouldn’t somehow get a pass on this unless it’s practitioners just want to flat out admit “We’re too pitiful to get it right.” Because clearly some labs can get it right.
Back to the Topic
And I’m willing to bet that if you group seemingly disparate papers on this topic by methdology, length of study, when the US was done, that sort of thing, you may see a systematically different set of results with the latter studies (longer duration, blinded US, longer time to do the measurement) showing a far smaller growth response in response to all training volumes than the short studies with poorer methodology. The short-term changes are simply too likely to be fluid shifts, especially with super high volumes. Long-term with a longer measurement time point, this goes away and you are more likely see actual growth. I could very well be wrong and will happily admit I am if this is the case. But I doubt I will be.
So whether or not we like or believe the results of this paper, the fact is that methodologically it was superior to nearly every other study on the topic ever done. It was longer, had sufficient subjects, blinded the US tech, blinded the exercise specialists, did the final US measurement at a later time point. Sure, the training program was bonkers and the set counting logic made the 1:1 assumption look semi-coherent. But that’s more about design than methodology. In terms of how it measured what it set out to measure, it CRUSHED Brad’s paper.
It also actually drew a guarded conclusions based on what it’s statistics said rather than a near absolute conclusion based on what its statistics absolutely did NOT say. Something all good scientists do. And something bad scientists do not do.
I still think that the results came out of the 1X/week muscle group frequency and that splitting that weekly volume across two workouts might give different results. Add to that the enormous strength gains in supposedly “trained” women and I think that they weren’t very well trained. And we know that less trained folks don’t need much volume to progress (and the highest volumes were simply overwhelming) so I don’t think it actually contradicts what we think we know about training women. And since it’s truly “only one paper” (in fact the ONLY paper), I won’t be changing anything anymore than anybody else will until there’s more work done. I can’t wait to see it, especially if its as rigorous as this paper was.
If it replicates this paper’s result, great. We’ll still need more data. If it contradicts this paper’s results. Great. We’ll still need more data. Until a pattern shows up across multiple studies from multiple labs (and it absolutely HAS for men where 6 of 7 papers, dismissing Raedelli, support moderate volumes as best), we can’t draw good conclusions.
There Ya’ Go, Mike
So there ya’ go, buddy. I dissected Barbalho at a level I’ve never examined any other paper. So in the future you can spare me your weak “Well you didn’t dissect other papers on your website” horseshit. It’s only a deflection but now you don’t have a pot to piss in. You and Greg and the rest will have to come up with different lame dismissals and deflections, now. I have faith in both of you and James and Brad to do so. If you spent as much energy doing better work as you did worrying about me, we’d all have a lot more time to watch the tubes.
Genuinely, I do want to thank you for finally getting me off my butt to do it. Because, as I said, this one bit you in the ass. You’re the guy in Internet debates that tried to cite a paper at me that did nothing but support my point. Because on nearly every level, Barbalho destroy’s Brad’s methodologically sloppy science and that’s not debatable. Yes, this paper had problems and I mentioned those in every case: no post-study anthropometrics, not controlling menstrual cycle time, the workout setup and set counting was dumb as shit, the abstract not exactly reporting the paper results correct, etc. But overall it made Brad’s paper look like a bad high school science project in terms of the methodological factors that really matter.
Barbalho had sufficient subjects, long duration, blinded US techs and exercise specialists, US done later after edema had dissipated and a guarded conclusion that actually matched both the data and the statistical analysis . Stuff that Brad and his crew are either unable or unwilling to do. But, that he is happy to criticize in papers that are not his own. That you and others are happy to make lame ass apologist excuses for.
Because, as Barbalho et. al. showed, it is actually possible to do good science unless, apparently, your name is Brad Schoenfeld. And when you do that, you don’t need other people to make lame duck excuses for you. Or come up with bullshit arguments after the fact. It’s a win win for everybody including the field of science, the researchers, and the current circle jerk who doesn’t have to face down my relentlessly obsessed ass when I go on one of my fucking tears. If it’s not garbage science, you don’t have to make excuses for it is my point.
Or waste 90 fucking minute’s of their life for a non-debate that accomplishes nothing when they could be watching Pornhub.
I told you this wasn’t over, Mike. And it’s still not.