The Factorio Benchmark Website

test-000001 : What is the best way to benchmark Factorio?

Factorio Version 0.16.51

The TLDR

Future tests will use the inbuilt utility accessed by passing the --benchmark parameter. The Linux headless server will be used to provide automation and reduce wasted time.

The Question

There are a number of possible ways to benchmark Factorio to retrieve performance data.

Performance could be gathered by pressing F4 to view advanced debugging options and enabling the show-fps parameter.
Performance could be measured by recording the data in show-time-usage.
Performance could be gleaned by running a profiler like the one in Visual Studio, VerySleepyCS, or Callgrind.
Performance data can be collected by running a benchmark, done by passing the --benchmark flag to the Factorio executable.

show-fps

This method shows us basic information FPS and UPS. FPS stands for frames per second, essentially how many frames have been displayed every second. UPS stands for updates per second. This number is how many updates to each entity (trains, inserters, assembling machines, furnaces, and more) are performed every second. These two numbers are soft-capped at 60, thus a FPS/UPS of 60/60 means that the game is running at full speed. This limit on the number of updates per second can be raised by running the game faster than normal speed. /c game.speed = 10 will raise the limit to 600UPS and so on. Note that the FPS remains at 60 (or your monitor's refresh rate) if the option vsync is enabled.

The problem with this method is that in normal Factorio gameplay, the amount of work done in each tick (update) can vary between ticks and can have upward or downward trends over thousands or millions of ticks. Thus recording the UPS in any one tick may not be representative of the average UPS over a greater length of time.

Another point of fault is with trying to compare differing systems. If one system has a much more powerful GPU, the share of render time will be reduced, which could cause incorrect comparisons to be made.

show-time-usage

This method of measuring performance is effectively a step up on show-fps. Performance here is broken down into several categories, though we have changed the units from updates per second, to milliseconds per update. That means our typical 60UPS is now equal to 16.66ms per tick.

This breakdown of the game's update is quite useful for us. We can now factor out the GPU's effect^[1] on the performance by ignoring the render time and instead looking at the Update: and Game Update: sections. The Update: section gives us a good indicator of the time a design consumes but it is still subject to the same faults that show-fps has. If one design is significantly better than another (+50%), we can know almost immediately. This measurement is also not dependent on the game's speed, so no commands need to be run here.

The Game Update: section gives us a top level breakdown of where the time is being spent.

Circuit networks are pretty self explanatory. Simply any entity checking the circuit network adds to this time.
Transport lines are effectively the two lanes on belts where items can travel. Splitters, belts, and underground belts add to this overhead.
Entity update is nearly every entity in the game, meaning bots, inserters, assembling machines and so on. If it doesn't fall under another category it ends up here.
Map generator takes time everytime a new chunk is generated. The player character and radars scanning typically cause this.
CRC is a check done to ensure integrity of the game in multiplayer.
Electric network primarily consists of getting and using the power. Every entity that uses power adds a small amount of overhead to this, which does add up. Power generation is also included here but it is effectively free if using solar panels.
Logistic manager is how robots are controlled. This consumes time when it looks for requester chests that need items, provider chests to provide the items, and controls the logic behind robots going to charge.
Construction manager is the time consumed by every ghost entity in the world, as well as any damaged entities. It does not matter if these ghosts are in range of any contruction radius, they will still consume time.
Path finder is used to create paths for biters. Ideally play exists without these menace so this does not factor in.
Trains are self evident. Updating their speeds, fuel levels, and station conditions all play a role.
Train path finder is the path finder for trains. Need I say more?
Commander controls biters, handling their movement and attack groups, and ordering them to expand. In the ideal game this never does anything.
I'm not sure how Chart Refresh differs from Chart update further below, but these control how the map is revealed, as well as things like map pins and trains moving on the map.

All these attributes are useful in debugging a design. And they can give a gauge of where to spend time optimizing. But as they exist here they aren't all that useful as this data is constantly changing and can't be easily collected.

Also of note is the Script update: section. This section gives us a per mod breakdown of how much time each mod takes to process. Particularly offending mods can be noted and removed if required.

Using a Profiler

If you thought the information in show-time-usage was too high level, then this section is for you.

A profiler essentially looks into the game while it is running and reports back the number of calls to a function as well as time spent in those functions. Because the developers of Factorio include debugging symbols in the game, these function names are useful to us. Some good profilers to use are Visual Studio or VerySleepyCS, or for Linux, Callgrind.

Instead of a category like entity update in our show-time-usage section, each individual entity type's functions are available to us. We can see exactly what ratio of time is used by inserters or labs. And in these entities we can see which functions are time consuming. Ex: A function of inserters is to get a pickup target, or to drop items.

However, the pitfall associated with this method is the larger amount of time required to gather and process results. It is excellent as a tool to deep dive into the finest details of a particular design, but becomes tiresome to attempt on more than a handful of designs.

--benchmark

That brings us to our final candidate for collecting and processing performance data, the --benchmark parameter. This parameter runs the game without rendering anything, as fast as it possibly can. With this parameter and related parameters, we can supply the exact number of ticks to run, as well as specific maps that we intend to benchmark. All of the data can be collected automatically for easy storage, retrieval, and processing.

Related parameters are --benchmark-ticks, which gives us specified ticks to benchmark for. --disable-audio saves us time by not loading the sound files. On Windows and Linux, we load all the textures which does take significant time still. However, we can use the headless version of Factorio available for Linux and intended for servers to skip this step (--disable-audio not needed for headless). As an example, this command will benchmark the map foo.zip for 1000 ticks on Linux.

./factorio --benchmark "foo.zip" --benchmark-ticks 1000 --disable-audio

It's not entirely free, because we lose the ability to test render related attributes^[1]. Normally this is a good thing, though there are test possibilities we lose by using this.

The Test

Now that we've outlined most of the possible ways to collect data, we need to set up our test. We won't test the show-time-usage or the profiler methods because they can be ruled out by our prior reasoning.

Since we are comparing data collection methods, we need 1 map where we can gather our data. After a quick browsing of the forums, this seems like a good choice.

For our test, we want to be able to take readings at the same time consistently. First we should teleport to something with a good visual indicator. Rocket launches fit the bill. /c game.player.teleport({-12848, -875}) gets us there. We set our game zoom to 1.000 by hitting the F9 key. The bottom rocket silo finishes readying for launch right around the 57092400 game tick. (The /c game.player.print(game.tick) command gets us the game tick). It is in this area we save the game, as from this point we will begin collecting our data.

For show-fps, we will take our first sample right as the bottom most silo's rocket reaches the edge of the screen. We then wait until around game tick 57098000 when the bottom silo finishes readying again. As the rocket again reaches the edge of the screen we record the data. Finally, after waiting a while, the bottom silo is ready again at tick 57114100. We take our final reading for the run as the rocket crosses the screen border. The duration of this exchange is roughly 23,000 ticks. Thus we can run these same 23,000 ticks when we do our testing of the other methods of data collection.

We repeat this procedure 3 times for each method, and then record the results to compare against the average of those results.

The Data

First up we have the show-fps method:

show-fps	Run 1	Run 2	Run 3
Sample point 1	34.7	34.6	34.7
Sample point 2	34.7	34.2	34.4
Sample point 3	36	36	35.9
Average per run:	35.13	34.93	35.00

Average overall:

35.02

This yields us an average UPS of 35.02

Next we have the --benchmark method. I wrote a helper script to automatically run and record the values reported by the benchmark. It is available here: https://github.com/mulark/factorio_benchmark_scripts

map_name	run_index	avg_ms	min_ms	max_ms	ticks	execution_time (ms)	Converted UPS
FuzzyPants.zip	1	27.06	24.674	56.338	23000	622371.106	36.937
FuzzyPants.zip	2	27.095	24.685	53.648	23000	623176.686	36.907
FuzzyPants.zip	3	27.073	24.654	53.965	23000	622676.959	36.955

We can convert our avg_ms values to UPS values by taking 1000 / avg_ms. These converted UPS numbers do not include the render overhead, thus they can not be directly compared to the show-fps numbers.

These values indicate that show-fps can be as much as 2.72% away from the average UPS if you take only 1 sample. The average distance from the average was 1.79%. Meanwhile benchmarking over 23,000 ticks, each run was less than 0.1% away from the overall average. Thus it can be reasonably concluded that the --benchmark method is superior.

Closing

In closing, all future tests will use the --benchmark method as it has been shown to be better than other methods. If rendering requires testing it's likely that the show-fps method will be used. There are still other elements which warrant testing, like looking for the optimal number of ticks to use and the optimal number of runs. However, those tests will be conducted at a later time and date.