The Factorio Benchmark Website

test-000001 : What is the best way to benchmark Factorio?

Factorio Version 0.16.51

The TLDR

Future tests will use the inbuilt utility accessed by passing the --benchmark parameter. The Linux headless server will be used to provide automation and reduce wasted time.

The Question

There are a number of possible ways to benchmark Factorio to retrieve performance data.

show-fps

This method shows us basic information FPS and UPS. FPS stands for frames per second, essentially how many frames have been displayed every second. UPS stands for updates per second. This number is how many updates to each entity (trains, inserters, assembling machines, furnaces, and more) are performed every second. These two numbers are soft-capped at 60, thus a FPS/UPS of 60/60 means that the game is running at full speed. This limit on the number of updates per second can be raised by running the game faster than normal speed. /c game.speed = 10 will raise the limit to 600UPS and so on. Note that the FPS remains at 60 (or your monitor's refresh rate) if the option vsync is enabled.

The problem with this method is that in normal Factorio gameplay, the amount of work done in each tick (update) can vary between ticks and can have upward or downward trends over thousands or millions of ticks. Thus recording the UPS in any one tick may not be representative of the average UPS over a greater length of time.

Another point of fault is with trying to compare differing systems. If one system has a much more powerful GPU, the share of render time will be reduced, which could cause incorrect comparisons to be made.

show-time-usage

This method of measuring performance is effectively a step up on show-fps. Performance here is broken down into several categories, though we have changed the units from updates per second, to milliseconds per update. That means our typical 60UPS is now equal to 16.66ms per tick.

This breakdown of the game's update is quite useful for us. We can now factor out the GPU's effect[1] on the performance by ignoring the render time and instead looking at the Update: and Game Update: sections. The Update: section gives us a good indicator of the time a design consumes but it is still subject to the same faults that show-fps has. If one design is significantly better than another (+50%), we can know almost immediately. This measurement is also not dependent on the game's speed, so no commands need to be run here.

The Game Update: section gives us a top level breakdown of where the time is being spent.

All these attributes are useful in debugging a design. And they can give a gauge of where to spend time optimizing. But as they exist here they aren't all that useful as this data is constantly changing and can't be easily collected.

Also of note is the Script update: section. This section gives us a per mod breakdown of how much time each mod takes to process. Particularly offending mods can be noted and removed if required.

Using a Profiler

If you thought the information in show-time-usage was too high level, then this section is for you.

A profiler essentially looks into the game while it is running and reports back the number of calls to a function as well as time spent in those functions. Because the developers of Factorio include debugging symbols in the game, these function names are useful to us. Some good profilers to use are Visual Studio or VerySleepyCS, or for Linux, Callgrind.

Instead of a category like entity update in our show-time-usage section, each individual entity type's functions are available to us. We can see exactly what ratio of time is used by inserters or labs. And in these entities we can see which functions are time consuming. Ex: A function of inserters is to get a pickup target, or to drop items.

However, the pitfall associated with this method is the larger amount of time required to gather and process results. It is excellent as a tool to deep dive into the finest details of a particular design, but becomes tiresome to attempt on more than a handful of designs.

--benchmark

That brings us to our final candidate for collecting and processing performance data, the --benchmark parameter. This parameter runs the game without rendering anything, as fast as it possibly can. With this parameter and related parameters, we can supply the exact number of ticks to run, as well as specific maps that we intend to benchmark. All of the data can be collected automatically for easy storage, retrieval, and processing.

Related parameters are --benchmark-ticks, which gives us specified ticks to benchmark for. --disable-audio saves us time by not loading the sound files. On Windows and Linux, we load all the textures which does take significant time still. However, we can use the headless version of Factorio available for Linux and intended for servers to skip this step (--disable-audio not needed for headless). As an example, this command will benchmark the map foo.zip for 1000 ticks on Linux.

./factorio --benchmark "foo.zip" --benchmark-ticks 1000 --disable-audio

It's not entirely free, because we lose the ability to test render related attributes[1]. Normally this is a good thing, though there are test possibilities we lose by using this.

The Test

Now that we've outlined most of the possible ways to collect data, we need to set up our test. We won't test the show-time-usage or the profiler methods because they can be ruled out by our prior reasoning.

Since we are comparing data collection methods, we need 1 map where we can gather our data. After a quick browsing of the forums, this seems like a good choice.

For our test, we want to be able to take readings at the same time consistently. First we should teleport to something with a good visual indicator. Rocket launches fit the bill. /c game.player.teleport({-12848, -875}) gets us there. We set our game zoom to 1.000 by hitting the F9 key. The bottom rocket silo finishes readying for launch right around the 57092400 game tick. (The /c game.player.print(game.tick) command gets us the game tick). It is in this area we save the game, as from this point we will begin collecting our data.

For show-fps, we will take our first sample right as the bottom most silo's rocket reaches the edge of the screen. We then wait until around game tick 57098000 when the bottom silo finishes readying again. As the rocket again reaches the edge of the screen we record the data. Finally, after waiting a while, the bottom silo is ready again at tick 57114100. We take our final reading for the run as the rocket crosses the screen border. The duration of this exchange is roughly 23,000 ticks. Thus we can run these same 23,000 ticks when we do our testing of the other methods of data collection.

We repeat this procedure 3 times for each method, and then record the results to compare against the average of those results.

The Data

First up we have the show-fps method:

show-fps Run 1 Run 2 Run 3
Sample point 1 34.7 34.6 34.7
Sample point 2 34.7 34.2 34.4
Sample point 3 36 36 35.9
Average per run: 35.13 34.93 35.00
Average overall: 35.02

This yields us an average UPS of 35.02

Next we have the --benchmark method. I wrote a helper script to automatically run and record the values reported by the benchmark. It is available here: https://github.com/mulark/factorio_benchmark_scripts

map_name run_index avg_ms min_ms max_ms ticks execution_time (ms) Converted UPS
FuzzyPants.zip 1 27.06 24.674 56.338 23000 622371.106 36.937
FuzzyPants.zip 2 27.095 24.685 53.648 23000 623176.686 36.907
FuzzyPants.zip 3 27.073 24.654 53.965 23000 622676.959 36.955

We can convert our avg_ms values to UPS values by taking 1000 / avg_ms. These converted UPS numbers do not include the render overhead, thus they can not be directly compared to the show-fps numbers.

These values indicate that show-fps can be as much as 2.72% away from the average UPS if you take only 1 sample. The average distance from the average was 1.79%. Meanwhile benchmarking over 23,000 ticks, each run was less than 0.1% away from the overall average. Thus it can be reasonably concluded that the --benchmark method is superior.

Closing

In closing, all future tests will use the --benchmark method as it has been shown to be better than other methods. If rendering requires testing it's likely that the show-fps method will be used. There are still other elements which warrant testing, like looking for the optimal number of ticks to use and the optimal number of runs. However, those tests will be conducted at a later time and date.