What Is Wrong with the Strategy Tester in MT4 and MT5

Hello, fellow Forex traders!
Today we will continue discussing such a broad topic as algorithmic trading and talk about such an essential tool as a trading robot tester. Some of the issues covered in this article have already been touched on in the blog, but in my view they still have not been explained in enough detail. So today we will look at what backtesting is, what difficulties and pitfalls arise when testing advisors, identify the limitations of the MetaTrader terminal when testing and optimizing advisors, and find ways to minimize those limitations as much as possible.
What is a backtest

About how to test an advisor in the MT4 terminal, and also in the MT5 terminal, an article has already been written on our blog. So I'll just give a definition of backtesting. A backtest is applying the rules of a trading system to a set of historical market data, that's all. That is, we defined a set of rules for entering the market, for exiting, for maintaining a position and calculating its volume, and then applied these rules to historical data, as if we were trading according to these rules at that moment. In this way, we can evaluate the performance of this system that could have been achieved in the past.
Perhaps the tester's most important function is simulation. The algorithm's author runs backtests to judge whether it is worth continuing to improve a forex advisor or whether the trading system chosen for research simply is not worth the time.
Backtesting also helps us evaluate whether a given strategy is worth using for live trading based on its past performance. Essentially, backtesting helps us weed out bad advisors before we risk real money.
In addition to the initial selection of strategies for further work with them and filtering out unsuitable strategies, the tester also helps to check the performance of the systems being created, identify errors and improve the performance of the algorithm.
Main mistakes when testing advisors

It is very easy to test the advisor, but, unfortunately, the test results are not at all the results of real trading. We get the simulation results. No matter how complex a model is, it will still contain certain assumptions and simplifications, which, of course, leads to inaccurate results. This is why there are many pitfalls associated with using testing. Below I will list the main mistakes when using testing.
Testing only on in-sample data.
This is the most common mistake of beginners and the worst. Essentially, this is the system curve fitting that happens when you use the same data to optimize and test a strategy. Such testing always greatly overestimates the results of the system, which will be visible in real trading. This is because the system has not been tested on other data, which will obviously be noticeably different from the one used in the test.
Market cycle change
Markets are non-stationary and constantly changing. Therefore, it is so difficult to select suitable parameters for system operation over a long period of time. Moreover, the longer this period, the more universal the system parameters are. However, unfortunately, there are no ideal parameters of the advisor, using which the robot would trade perfectly on any market. This is one of the reasons why advisors perform worse in the real market than in tests.
Transaction costs
Many forex traders, especially beginners, do not take into account the costs associated with opening and closing positions. I often see tests run with unrealistically low spreads: for example, instead of using a 2-3 pip spread on EURUSD for the test, people use a 0.5-1 pip spread. The result is a holy-grail picture that has nothing to do with reality, because the fact that your broker, for self-promotion, advertises a 0.5-point spread somewhere does not mean that this is what you will actually get. In addition, many people forget about commissions per trade. They do not exist on all account types and not with all brokers, but as a rule, where the spread is low, the commission is there, and it is not small. You should also not forget about swaps - they also distort test results and affect the final outcome. And of course there is slippage, the scourge of all pip traders. Add all these factors together and it turns out that trading is quite an expensive activity: in the real market you will pay 0.5 points of spread + 0.7 points of commission + 0.3 points of slippage on entry and 0.3 points on exit, so the real transaction cost will be 1.8 points.
Peeking into the future
Future data can be used in testing, most often due to errors in writing the algorithm, but sometimes also due to malicious intent. Just last year, on the mql5.com website in the “market” section, it was very fashionable to sell advisors that looked into the future. It is much more convenient to trade and show phenomenal trading results if you know the future. But such results, unfortunately, have nothing to do with real trading.
When backtesting, you can easily scalp a thousand lots and have good results. However, in real trading, such volumes will inevitably influence the market; even a position of 100 lots in the evening or at night can move the price of even popular trading pairs, it has been verified. The tester does not take this influence into account.
Historical data
Many brokers offer to use their historical database. As a rule, the quotes there are not of the best quality. It is possible to find large missing chunks of historical data, especially on lower time frames. You can also find many sources of free quotes on the Internet, usually of equally dubious quality. The quality of quotes greatly influences the backtest results, so this problem should be taken as seriously as possible.
Low robustness
Some traders, especially beginners, allow systems with low robustness to trade with real money. Robustness is a system's resistance to changes in the input data. For example, if you test your system from the 3rd of this month to today, you may get a good result, but if you start the test from the 4th, the system goes into drawdown. That means the system has low robustness. Another example is when you change the period of an indicator that affects market entry from 9 to 10 and the system starts losing money. It is a big mistake to allow such systems into live trading.
Psychology
As I said in the previous article, psychology plays a much smaller role in algorithmic trading than in hand trading. Various sellers of trading systems promote the idea that when trading with a robot, you can completely forget about psychology. However, completely ignoring psychology is another mistake of novice traders. However, there are a number of trading systems that may seem quite good in testing but are simply unbearable in real trading. In addition, just as many newbie manual traders change the rules of their trading on the go, less experienced algorithmic traders often reach out to already working systems, trying to “tweak” them on the go. All this, of course, leads to monetary losses.
Types of testers based on operating principle

At the moment, there are many different terminals on the market for trading financial instruments. Many terminals allow you to write expert advisors for automated trading, and most of them, as a rule, have built-in testers and optimizers. They differ greatly in functionality and capabilities, but by their operating principle testers can be divided into two types: “cyclical” and “event-driven.” Each type has its own advantages and disadvantages.
Cyclic backtesters
This type of tester is the easiest to implement and the most transparent in operation. Such a tester simply steps through each bar one after another. When a new price arrives, the tester performs calculations on prices, computes indicators, and allows orders to be placed and modified. Then the next iteration begins, and this continues until the test is stopped. At the same time, the tester stores some statistical data on the advisor's performance - profit, number of trades, drawdown, and so on - and at the end produces a report on the advisor's test results. As you can see, this design is very simple: you just load historical data and move through the prices line by line, performing calculations each time. This concept will probably seem familiar to you. That's right: MetaTrader works according to exactly this scheme. Well, not entirely, because the Mql language still has some predefined events. So this platform, like many others, is really more of a mixed type.
As you probably already guessed, such backtesters are very easy to implement in almost any programming language. At the same time, they work very quickly - you can quickly test and optimize many combinations of EA parameters.
Unfortunately, the main disadvantage of such a tester is the unrealistic nature of the resulting backtests. Quite often, such backtesters do not even take into account the spread or commission, and orders are always executed at market prices without slippage and instantly. Therefore, professionals use such testers only for an initial assessment of the effectiveness of the algorithm in order to decide whether it is worth working on the advisor further.
Event-Driven Testers
To understand how such a tester works more clearly, imagine a computer game. The player is constantly interacting with the game world, and that world is not static - something is always happening, evil creatures attack the player, and friendly characters are busy smashing each other's skulls with axes over some household quarrel. So that your computer does not run away from the sheer amount of action, smart people came up with the idea of putting all those calculations into an infinite while loop called the event loop or game loop, inside which an event queue is built. These events are continuously processed inside the loop one by one at a speed that matches your computer's power. In a tester, such events might be the appearance of new trading signals, readiness to send a message to the broker, the arrival of a new tick, or information received from the broker. When a specific event occurs, it is processed by the corresponding tester module and may generate new events, which are then added back to the event queue.
Among the positive aspects of this type of tester, we can note the testing model as close as possible to reality and, as a result, very accurate tests that take into account many factors that arise during real trading. This tester architecture allows for the use of portfolios of systems and tools, and such testers, as a rule, contain the ability to test and optimize a large number of systems on various tools.
The disadvantages of this design include the complexity of the code and, accordingly, a large field for errors. To write systems, you will need knowledge of object-oriented programming and good experience in system development. Another significant disadvantage associated with the interaction of various tester modules is the slow speed of operation. Testing, and especially optimization, can take quite a long time.
MetaTrader terminals

On the blog pages you can find review of the MetaTrader4 terminal and even MetaTrader5. But these articles, unfortunately, do not write about the most important thing: whether you can use these terminals to test and optimize your advisors and how much you can trust the results. Let's try to understand this issue.
As I already said, professional algorithmic traders do not use the MetaTrader terminal mainly for three reasons: insufficient flexibility of the language for writing expert advisors, low testing accuracy and unsatisfactory testing speed. MetaQuotes has created an excellent terminal - simple and convenient. This is a great starting point for getting to know the world of algorithmic trading. A lot of good ideas have been added to the platform recently, especially in the fifth version of the terminal, but, unfortunately, their implementation has greatly let us down. For example, how can people appreciate an improved tester if there is no way to import their quotes? Or why should people rewrite the code of their advisors and indicators themselves, when they could write a simple program that would do all this automatically? But first things first.
So I came up with a simple trading system whose algorithm I implemented as an mql4 advisor, an mql5 advisor, and an R-language tester/optimizer with the same built-in strategy. I did this in order to run tests and optimizations and compare the speed.
Testing the advisor on the hourly timeframe over the past fifteen years took 20 minutes in the fourth version of the terminal, about 5 minutes in the fifth version, and 13 seconds in the R language tester. Optimization of fifty thousand combinations of parameters of the same advisor under the same conditions in the fourth version took 10 days, in the fifth - only three days, in R - 16 minutes. I couldn’t do the same test on 15 pairs simultaneously in MT4 - it’s simply not provided for. The fifth version completed it in 2 hours 27 minutes, R in 6 minutes. The R language is far from the fastest; if I had written a tester in C#, for example, I’m sure the results would have been 10 times better.
Let's analyze MetaTrader 4

What makes the tester so slow? One thing I can say for sure is that the terminal uses only one processor core. So even if your machine has four or more cores, that will not help much. The problem can be mitigated somewhat by launching another terminal and testing or optimizing the advisor in parallel on another instrument. That will speed things up a little, but it will not solve the problem completely - a single test will still take painfully long.
Another serious problem with the fourth terminal is the inability to use tick data. When we trade in real time, prices arrive at the terminal at different moments as they change. The arrival of a price is not tied to a specific interval such as once per minute or once per hour. Historical data in the terminal's quote database is stored as OHLC prices for various timeframes, for example H1, D1, M15. During testing, the arrival of new ticks is simulated on the basis of the M1 timeframe. We all know that the volumes sent by the broker are not real market volumes at all. They are simply the number of real ticks that occurred during a certain period of time. In other words, the tester knows the open and close prices of the M1 candle, the high and low over that one-minute period, and the number of ticks received during that minute. What the prices of those ticks actually were and exactly when each tick arrived is unknown. Based on some algorithms, the tester reconstructs how the price moved inside the candle. So here again we have the same simplified model discussed above. Even so, traders found a workaround. There are several special programs that allow you to use real tick data for testing and optimization. These include the free program TickstoryLite and the paid program Tick Data Suite. You can read about the features of each program in the articles above, but I would recommend not being stingy and buying the latter. Tick Data Suite not only lets you test advisors on tick data, it also adds a couple of very useful features to the terminal: a floating spread (that is, essentially two tick streams, as in real life - separate Ask and Bid) and an excellent stress test indispensable for scalping advisors - slippage emulation.
Now I need to say a few words about modeling quality, which in the standard terminal does not rise above 90%, but becomes 99% with the help of the programs mentioned above. Many of you have probably often seen the claim online that 90% tests are complete rubbish and that only 99% tests are fully accurate. Of course, that is not entirely true (to be honest, any test performed on the MT4 or MT5 platform, while not rubbish, is still quite far from reality). So what is modeling quality? It is simply a number calculated using a formula invented by MetaQuotes:
Modeling quality = ((0.25x(StartGen-StartBar)+0.5x(StartGenM1-StartGen)+0.9x(HistoryTotal-StartGenM1))/(HistoryTotal-StartBar))x100%, where:
HistoryTotal - number of bars in history;
StartBar – number of the bar from which the simulation began;
StartGen – number of the bar from which testing began based on historical data of the nearest timeframe;
StartGenM1 – the number of the bar from which the minute-based simulation began.
As you can see, modeling quality simply tells us which timeframes were used in the test. If only M1 data was used when testing an advisor on the M15 period, then the modeling quality will be 90%. If only the M15 period itself was used, then according to the formula the quality will be zero or, as MT4 reports it, n/a. And when testing on ticks, they simply decided to write 99% so it would be clear that everything was working properly and the test was really using ticks. Does that mean that a test with n/a quality gives completely inaccurate data, while a test with 99% quality gives very accurate data? No. It only means that in the first case the close-price model was used, and in the second case ticks were used. At the same time, the first test may easily be much more accurate than the second, and this is important to understand: modeling quality and the quality of the quotes themselves are completely different things.
It is also important to understand that the quality of the test results depends on the strategy itself. For example, if a strategy works on D1 using closing prices, without stops, take-profits, or trailing stops, and enters only at market prices without pending orders, then you can get a completely accurate backtest even with the “checkpoints” model. For testing strategies that use stops and take-profits, at least 90% quality is required to get a result that is at least somewhat close to reality. Moreover, the lower the timeframe on which the advisor works, the greater the effect of the spread on the results. With a fixed spread value, tests of such advisors can be extremely far from reality, to the point where an advisor that is profitable in testing will blow the account in live trading. As you can see, the actual quality of a test depends on many factors, such as quote quality, the characteristics of the system itself, and the system's ability to bypass or neutralize the limitations of the MT4 tester. This once again proves my point that if you want to trade profitably with an advisor, you must know and deeply understand exactly how it works and what its algorithm is so that you can reduce the terminal's limitations to zero.
Another serious problem regarding the use of tick data is broker dependence. The results of tests using tick data from one broker will differ from the results using data from another, since the differences between the data of even the closing prices of different brokers can reach 10-15% of the candle size. In the long term, such a difference can lead to very different profitability schedules for the same advisor working on accounts of different brokers.
Now let's talk about another terminal limitation, this time one that has already been fixed. Originally, it was impossible to change the spread used for testing in the MT4 terminal. As a result, the tests came out differently every time. Moreover, if a trader ran a test on a weekday and got a good result from the advisor, then reran the same test on a weekend or at night, it was entirely possible to discover that the advisor was losing money. This caused a lot of confusion - how could that happen if it had just worked? And even if you knew about this issue, it was still inconvenient: to test a night scalper, the trader had to look out the window and wait for the moon. In general, it was quite a circus. Fortunately, this disgrace was later corrected, giving traders the ability to set the spread required for testing.
New MetaTrader 5 terminal

So, the mql5 language has quite a lot of capabilities. That is undoubtedly a major plus. The downside is that advisors written in mql4 will not work in the MT5 terminal. As a result, traders who want to switch to the new terminal, are not familiar with programming, and still want to trade with advisors that worked successfully on the previous terminal version are forced to pay for a port. Guess where people are most likely to look for such programmers? That's right: in the “Freelance” section on the website of the same MetaQuotes company, which earns a commission from every order, every advisor and indicator sold, and much more. Business is business.
A big advantage of the terminal compared with the previous version is its use of multiple cores. This is done using the Test Agent Manager. In this case, you can use any number of cores. You can also build an entire network of your own computers and use their combined power to test and optimize advisors on the “main” machine, or even use a cloud network where other traders rent out their computing power for a reasonable fee. It is an exciting idea; in practice, the performance increase I saw was about threefold. At the same time, much depends on internet connection speed and other factors. I also tried building my own network, but the computers kept disconnecting, and it was more annoying than helpful.
And one more observation regarding MT5. Tests on the fifth platform are very similar to tests on the fourth, but they always turn out significantly better than on MT4. Most likely, this is due to slightly different quotes.
Common problems

The old mql4 was a very simple language that could be mastered in at most a week. Over the past year, many changes have been added to mql4 that have brought it much closer to the fifth version, in particular support for many object-oriented programming features such as classes, encapsulation, inheritance, polymorphism, overloading, and abstract classes. And yet, both mql4 and mql5 are still less capable than an independent programming language.
What is actually happening inside the tester during testing and optimization that slows the whole process down so much? I do not know. Nobody knows except the programmers who wrote the terminal, because the code is closed and there is no way to see how everything is really calculated there. By contrast, with a self-written tester I know exactly what is happening, where each number comes from, and how it is calculated. Most importantly, I know how reliable my backtest is.
Another drawback that follows from the previous one is the closed optimization algorithm. There are many different ways to optimize an advisor's parameters. There are also many different metrics you might want to optimize for. You can create your own parameter and display it in the “optimization results” table by adding several lines of code to the advisor itself. But you still will not be able to optimize for it.
The saddest drawback of the MT4 and MT5 terminals is that it is not possible to conduct tests on several instruments in order to obtain a summary report on the trading portfolio of advisors. Therefore, you have to use third-party software, such as Report Manager or SQ EA Analyzer. Even though the first program is free, I recommend purchasing the second option due to its much more extensive and much needed features.
Both terminals use only one quote stream - Bid. Ask quotes are then calculated from the specified spread (Bid + Spread). As a result, the Ask chart completely coincides with the Bid chart, which of course is far from true in reality. The spread is constantly changing, and if you look at the tick chart in the terminal, you will see that the Bid and Ask charts do not always look the same. This is another simplification. What does it lead to? For example, for advisors trading on H1 and higher, a couple of points of error over roughly 15 years may produce a final profit inaccuracy of around plus or minus 3-10% - tolerable. But for scalpers, that inaccuracy can lead to an error of up to 50%. This situation is also dangerous for advisors that work with pending orders. A fixed spread means that the test results of such advisors can be very far from reality, because in the test, for example, a losing trade may be missed, while in real life it would have occurred because the actual spread at that moment was 3 points rather than the 2 points used in the test. Or conversely, many profitable trades seen in the tester may never have happened in reality.
Another drawback, not too serious but still affecting test accuracy, is the calculation of pip value for crosses, which uses the following formula: Point price = position volume * point size * current quote of the base currency against USD / current rate of the currency pair (cross rate). The quote of the base currency and the pair's rate are taken by the tester from today and that value is used throughout the test. But 20 or even 5 years ago, both prices were different, and accordingly the pip value was different as well. This is not such a terrible flaw - if the bot blows the account, it will do so with any pip value. Here we are only talking about the fact that the final test results are inaccurate.
Roughly the same thing happens if your account is opened not in USD, but, for example, in EUR. Let's assume you are testing the advisor on the usdchf pair and the tester will take the current exchange rate of eurusd and eurchf to calculate the value of a pip. Naturally, such a test will also have an error, since these quotes also change over time.
Just above I mentioned the EA Analyzer program. It provides a number of different stress tests designed to help you better assess the consequences of using your advisors on a real account. For example, a Monte Carlo simulation will allow you, with a certain degree of probability, to estimate the worst-case scenario for the advisor and, if it suits you, start working. In professional terminals, this stress test is built in, as well as many others, for example, tests for robustness of advisor settings, broker dependence or spread dependence. In the previous article, I said that robots do not predict the future; we can only, based on tests, obtain a certain probability that the advisor will bring us profit. Stress tests are very important tools for work, because the more thoroughly the robot is tested, the more likely it is that it will work the same way it did on historical data.
Another factor that affects test accuracy is the swap value. In the MT4 and MT5 tester, the current swap value is used for the entire test, even though the swap itself definitely changes in real life. And it can change several times a year. Funny enough, different brokers also have different swap values for the same instrument. For example, at broker RoboForex, the current swap on the pair usdchf is -2.5 points for shorts and -1.4 points for longs. At the same time, at Forex4u these figures are -5.5 and +2.4 points. If you are testing an overnight scalper, the swap can play a decisive role both in the profitability of the backtest and in the profitability of the trading itself. At the same time, you cannot realistically assess the impact of swap size on trading: this value cannot be set, let alone changed during testing - it is taken directly from the data of the broker whose account is currently active in the terminal.
As you can see, a huge number of small things that could be solved with just a couple of lines of code (well, maybe not a couple, but still) remain unimplemented. Taken together, these little things significantly complicate working with the terminal and, in the case of MT5, can make it impossible. The fact is that the end users of the terminal - you and me - are not MetaQuotes' clients. Their clients are primarily brokers, and they largely do not care about the needs of ordinary traders.
And finally, let's discuss the use of time in the MT4 platform. The mql language has many functions that use time in relation to the prices of a specific candle, for example Open or High. In most cases, when writing trading algorithms, we use this data in one way or another. At the same time, this data is not tied to the broker's clock directly; it is tied to the arrival of new ticks. Let's say we use the H1 period. When a new hour begins, if no ticks have arrived at that moment, a new candle appears only as a dash until new ticks start coming in. In that case, the hour has already changed, but while there are still no new ticks, the same Close effectively becomes Close. In practice, this means that if the advisor uses time in the form of Hour() and gets candle prices through Close, Open, High, and Low, it will receive a signal at the desired hour, but the OHLC prices will not yet have been updated because no new ticks have arrived. In other words, the prices of bar No. 1 will actually still be the prices of the second bar. Therefore, in this case, the algorithm needs to wait for the first ticks of the new candle to appear. Note that this applies only to advisors that use specific entry times and price calculations. If candles are not used for entry, this error will not occur.
So what should you choose – dependent or independent platform?

For traders interested in developing scalping trading systems that work on timeframes up to M5, MetaTrader is definitely not suitable. Even with 99% modeling quality, the results will still differ greatly from reality (although you can achieve at least some rough resemblance). In fact, it is quite easy to create a grail in the MT4 terminal, but in real trading it most likely will not work. Any market aspect ignored during the test can lead to results that are radically different from the test.
I do not want to say that it is impossible to create a scalper in MT4 that is profitable over the long term. It is just much harder than creating, say, a profitable long-term system. Therefore, if you are still not very familiar with algorithmic trading, I recommend starting by creating advisors for timeframes no lower than H1, and only then trying to build a portfolio of profitable scalpers.
It is important to understand that commercial platforms such as MetaTrader, MetaStock, TradeStation, NinjaTrader, and others are by no means professional platforms. Each of them has its own limitations, and closed source code means that many of those limitations are neither easy nor even possible to discover. In general, the first idea that comes to mind for a trader who wants to use advisors is to use a commercial platform, for example one of those listed above. It already includes everything - testing, optimization, and trading itself. Everything is convenient and simple, all in one package. But such a decision makes the trader dependent on the chosen platform and its developers: on what functionality the authors decide to add and what they do not, as well as on the errors and inaccuracies they made during development, many of which cannot be fixed. Commercial platforms do provide a trader with a certain basic level of functionality. But if you want to engage in high-frequency trading, for example, such platforms are no longer suitable.
An independent solution is one developed by you personally. All the functionality you need, testing accuracy, the number of optimization options, various stress tests, execution speed, and other perks depend entirely on you. You can choose for yourself which liquidity provider API to use while remaining independent from the broker and its manipulations with the price stream. Such a solution has virtually unlimited possibilities and is constrained only by your imagination and knowledge. It is exactly this kind of independent solution that professional algorithmic traders choose.
After all, when you connect directly to a liquidity provider, you immediately free yourself from dependence on a slow, inefficient, imperfect platform, from brokers who manipulate prices and the execution of your orders, and from the limitations of various strategy types that are simply impossible to implement through commercial terminals. However, this approach is far from suitable for everyone, because to build a good platform for trading you need an enormous amount of knowledge, serious programming experience, and a great deal of time. As a rule, an entire team of programmers is hired to develop platforms of this type, and they implement the plan in a fairly short time. All of this requires either a lot of time and knowledge or a lot of money.
Conclusion

But until you have at least ten million dollars in your account (and once you reach that threshold, you will not find a single trader using a commercial product for trading), there is nothing wrong with using a ready-made platform such as MetaTrader 4. The important thing is to always remember its limitations, take them into account in your work, and remain attentive and critical of the results you get from the tester. Trade within the limits of what your terminal allows, try to create advisors that are “immune” to its shortcomings, use third-party programs such as TickDataSuite and SQ EA Analyzer to minimize the terminal's weaknesses, and then, even when using such an imperfect product as MT4, you will still be able to make your profit from the market.
Best regards, Dmitry aka Silentspec
TradeLikeaPro.ru
Today we will find out what backtesting is, what pitfalls arise when testing expert advisors, and how to minimize the limitations of the MetaTrader terminal.