Skip to content

Commit 1d25757

Browse files
committed
Optionally prefetch referenced data in recovery.
Introduce a new GUC recovery_prefetch, disabled by default. When enabled, look ahead in the WAL and try to initiate asynchronous reading of referenced data blocks that are not yet cached in our buffer pool. For now, this is done with posix_fadvise(), which has several caveats. Better mechanisms will follow in later work on the I/O subsystem. The GUC maintenance_io_concurrency is used to limit the number of concurrent I/Os we allow ourselves to initiate, based on pessimistic heuristics used to infer that I/Os have begun and completed. The GUC wal_decode_buffer_size is used to limit the maximum distance we are prepared to read ahead in the WAL to find uncached blocks. Reviewed-by: Alvaro Herrera <[email protected]> (parts) Reviewed-by: Andres Freund <[email protected]> (parts) Reviewed-by: Tomas Vondra <[email protected]> (parts) Tested-by: Tomas Vondra <[email protected]> Tested-by: Jakub Wartak <[email protected]> Tested-by: Dmitry Dolgov <[email protected]> Tested-by: Sait Talha Nisanci <[email protected]> Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
1 parent f003d9f commit 1d25757

File tree

23 files changed

+1502
-19
lines changed

23 files changed

+1502
-19
lines changed

doc/src/sgml/config.sgml

+83
Original file line numberDiff line numberDiff line change
@@ -3565,6 +3565,89 @@ include_dir 'conf.d'
35653565
</variablelist>
35663566
</sect2>
35673567

3568+
<sect2 id="runtime-config-wal-recovery">
3569+
3570+
<title>Recovery</title>
3571+
3572+
<indexterm>
3573+
<primary>configuration</primary>
3574+
<secondary>of recovery</secondary>
3575+
<tertiary>general settings</tertiary>
3576+
</indexterm>
3577+
3578+
<para>
3579+
This section describes the settings that apply to recovery in general,
3580+
affecting crash recovery, streaming replication and archive-based
3581+
replication.
3582+
</para>
3583+
3584+
3585+
<variablelist>
3586+
<varlistentry id="guc-recovery-prefetch" xreflabel="recovery_prefetch">
3587+
<term><varname>recovery_prefetch</varname> (<type>boolean</type>)
3588+
<indexterm>
3589+
<primary><varname>recovery_prefetch</varname> configuration parameter</primary>
3590+
</indexterm>
3591+
</term>
3592+
<listitem>
3593+
<para>
3594+
Whether to try to prefetch blocks that are referenced in the WAL that
3595+
are not yet in the buffer pool, during recovery. Prefetching blocks
3596+
that will soon be needed can reduce I/O wait times in some workloads.
3597+
See also the <xref linkend="guc-wal-decode-buffer-size"/> and
3598+
<xref linkend="guc-maintenance-io-concurrency"/> settings, which limit
3599+
prefetching activity.
3600+
This setting is disabled by default.
3601+
</para>
3602+
<para>
3603+
This feature currently depends on an effective
3604+
<function>posix_fadvise</function> function, which some
3605+
operating systems lack.
3606+
</para>
3607+
</listitem>
3608+
</varlistentry>
3609+
3610+
<varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
3611+
<term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
3612+
<indexterm>
3613+
<primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
3614+
</indexterm>
3615+
</term>
3616+
<listitem>
3617+
<para>
3618+
Whether to prefetch blocks that were logged with full page images,
3619+
during recovery. Often this doesn't help, since such blocks will not
3620+
be read the first time they are needed and might remain in the buffer
3621+
pool after that. However, on file systems with a block size larger
3622+
than
3623+
<productname>PostgreSQL</productname>'s, prefetching can avoid a
3624+
costly read-before-write when a blocks are later written.
3625+
The default is off.
3626+
</para>
3627+
</listitem>
3628+
</varlistentry>
3629+
3630+
<varlistentry id="guc-wal-decode-buffer-size" xreflabel="wal_decode_buffer_size">
3631+
<term><varname>wal_decode_buffer_size</varname> (<type>integer</type>)
3632+
<indexterm>
3633+
<primary><varname>wal_decode_buffer_size</varname> configuration parameter</primary>
3634+
</indexterm>
3635+
</term>
3636+
<listitem>
3637+
<para>
3638+
A limit on how far ahead the server can look in the WAL, to find
3639+
blocks to prefetch. Setting it too high might be counterproductive,
3640+
if it means that data falls out of the
3641+
kernel cache before it is needed. If this value is specified without
3642+
units, it is taken as bytes.
3643+
The default is 512kB.
3644+
</para>
3645+
</listitem>
3646+
</varlistentry>
3647+
3648+
</variablelist>
3649+
</sect2>
3650+
35683651
<sect2 id="runtime-config-wal-archive-recovery">
35693652

35703653
<title>Archive Recovery</title>

doc/src/sgml/monitoring.sgml

+84-2
Original file line numberDiff line numberDiff line change
@@ -337,6 +337,13 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
337337
</entry>
338338
</row>
339339

340+
<row>
341+
<entry><structname>pg_stat_prefetch_recovery</structname><indexterm><primary>pg_stat_prefetch_recovery</primary></indexterm></entry>
342+
<entry>Only one row, showing statistics about blocks prefetched during recovery.
343+
See <xref linkend="pg-stat-prefetch-recovery-view"/> for details.
344+
</entry>
345+
</row>
346+
340347
<row>
341348
<entry><structname>pg_stat_subscription</structname><indexterm><primary>pg_stat_subscription</primary></indexterm></entry>
342349
<entry>At least one row per subscription, showing information about
@@ -2917,6 +2924,78 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
29172924
copy of the subscribed tables.
29182925
</para>
29192926

2927+
<table id="pg-stat-prefetch-recovery-view" xreflabel="pg_stat_prefetch_recovery">
2928+
<title><structname>pg_stat_prefetch_recovery</structname> View</title>
2929+
<tgroup cols="3">
2930+
<thead>
2931+
<row>
2932+
<entry>Column</entry>
2933+
<entry>Type</entry>
2934+
<entry>Description</entry>
2935+
</row>
2936+
</thead>
2937+
2938+
<tbody>
2939+
<row>
2940+
<entry><structfield>prefetch</structfield></entry>
2941+
<entry><type>bigint</type></entry>
2942+
<entry>Number of blocks prefetched because they were not in the buffer pool</entry>
2943+
</row>
2944+
<row>
2945+
<entry><structfield>skip_hit</structfield></entry>
2946+
<entry><type>bigint</type></entry>
2947+
<entry>Number of blocks not prefetched because they were already in the buffer pool</entry>
2948+
</row>
2949+
<row>
2950+
<entry><structfield>skip_new</structfield></entry>
2951+
<entry><type>bigint</type></entry>
2952+
<entry>Number of blocks not prefetched because they were new (usually relation extension)</entry>
2953+
</row>
2954+
<row>
2955+
<entry><structfield>skip_fpw</structfield></entry>
2956+
<entry><type>bigint</type></entry>
2957+
<entry>Number of blocks not prefetched because a full page image was included in the WAL and <xref linkend="guc-recovery-prefetch-fpw"/> was set to <literal>off</literal></entry>
2958+
</row>
2959+
<row>
2960+
<entry><structfield>skip_seq</structfield></entry>
2961+
<entry><type>bigint</type></entry>
2962+
<entry>Number of blocks not prefetched because of repeated access</entry>
2963+
</row>
2964+
<row>
2965+
<entry><structfield>distance</structfield></entry>
2966+
<entry><type>integer</type></entry>
2967+
<entry>How far ahead of recovery the prefetcher is currently reading, in bytes</entry>
2968+
</row>
2969+
<row>
2970+
<entry><structfield>queue_depth</structfield></entry>
2971+
<entry><type>integer</type></entry>
2972+
<entry>How many prefetches have been initiated but are not yet known to have completed</entry>
2973+
</row>
2974+
<row>
2975+
<entry><structfield>avg_distance</structfield></entry>
2976+
<entry><type>float4</type></entry>
2977+
<entry>How far ahead of recovery the prefetcher is on average, while recovery is not idle</entry>
2978+
</row>
2979+
<row>
2980+
<entry><structfield>avg_queue_depth</structfield></entry>
2981+
<entry><type>float4</type></entry>
2982+
<entry>Average number of prefetches in flight while recovery is not idle</entry>
2983+
</row>
2984+
</tbody>
2985+
</tgroup>
2986+
</table>
2987+
2988+
<para>
2989+
The <structname>pg_stat_prefetch_recovery</structname> view will contain only
2990+
one row. It is filled with nulls if recovery is not running or WAL
2991+
prefetching is not enabled. See <xref linkend="guc-recovery-prefetch"/>
2992+
for more information. The counters in this view are reset whenever the
2993+
<xref linkend="guc-recovery-prefetch"/>,
2994+
<xref linkend="guc-recovery-prefetch-fpw"/> or
2995+
<xref linkend="guc-maintenance-io-concurrency"/> setting is changed and
2996+
the server configuration is reloaded.
2997+
</para>
2998+
29202999
<table id="pg-stat-subscription" xreflabel="pg_stat_subscription">
29213000
<title><structname>pg_stat_subscription</structname> View</title>
29223001
<tgroup cols="1">
@@ -5049,8 +5128,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
50495128
all the counters shown in
50505129
the <structname>pg_stat_bgwriter</structname>
50515130
view, <literal>archiver</literal> to reset all the counters shown in
5052-
the <structname>pg_stat_archiver</structname> view or <literal>wal</literal>
5053-
to reset all the counters shown in the <structname>pg_stat_wal</structname> view.
5131+
the <structname>pg_stat_archiver</structname> view,
5132+
<literal>wal</literal> to reset all the counters shown in the
5133+
<structname>pg_stat_wal</structname> view or
5134+
<literal>prefetch_recovery</literal> to reset all the counters shown
5135+
in the <structname>pg_stat_prefetch_recovery</structname> view.
50545136
</para>
50555137
<para>
50565138
This function is restricted to superusers by default, but other users

doc/src/sgml/wal.sgml

+17
Original file line numberDiff line numberDiff line change
@@ -803,6 +803,23 @@
803803
counted as <literal>wal_write</literal> and <literal>wal_sync</literal>
804804
in <structname>pg_stat_wal</structname>, respectively.
805805
</para>
806+
807+
<para>
808+
The <xref linkend="guc-recovery-prefetch"/> parameter can
809+
be used to improve I/O performance during recovery by instructing
810+
<productname>PostgreSQL</productname> to initiate reads
811+
of disk blocks that will soon be needed but are not currently in
812+
<productname>PostgreSQL</productname>'s buffer pool.
813+
The <xref linkend="guc-maintenance-io-concurrency"/> and
814+
<xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
815+
concurrency and distance, respectively. The
816+
prefetching mechanism is most likely to be effective on systems
817+
with <varname>full_page_writes</varname> set to
818+
<varname>off</varname> (where that is safe), and where the working
819+
set is larger than RAM. By default, prefetching in recovery is enabled
820+
on operating systems that have <function>posix_fadvise</function>
821+
support.
822+
</para>
806823
</sect1>
807824

808825
<sect1 id="wal-internals">

src/backend/access/transam/Makefile

+1
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ OBJS = \
3131
xlogarchive.o \
3232
xlogfuncs.o \
3333
xloginsert.o \
34+
xlogprefetch.o \
3435
xlogreader.o \
3536
xlogutils.o
3637

0 commit comments

Comments
 (0)