Skip to content

Commit 98fe150

Browse files
author
xiongcc
committed
添加第48章:How to generate fake data
1 parent 37b9559 commit 98fe150

File tree

1 file changed

+307
-0
lines changed

1 file changed

+307
-0
lines changed

How to generate fake data.md

Lines changed: 307 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,307 @@
1+
# How to generate fake data
2+
3+
>我每天都会发布一篇新的 PostgreSQL "howto" 文章。加入我的旅程吧 — 订阅、提供反馈、分享!
4+
5+
## 简单数字
6+
7+
从 1 到 5 的连续数字:
8+
9+
```sql
10+
nik=# select i from generate_series(1, 5) as i;
11+
i
12+
---
13+
1
14+
2
15+
3
16+
4
17+
5
18+
(5 rows)
19+
```
20+
21+
从 0 到 100 的 5 个随机 `BIGINT` 数字:
22+
23+
```sql
24+
nik=# select (random() * 100)::int8 from generate_series(1, 5);
25+
int8
26+
------
27+
85
28+
61
29+
44
30+
70
31+
16
32+
(5 rows)
33+
```
34+
35+
注意,根据[文档](https://postgresql.org/docs/current/functions-math.html#FUNCTIONS-MATH-RANDOM-TABLE)`random()`
36+
37+
> 使用的是确定性的伪随机数生成器。它的速度很快,但不适合用于加密应用。
38+
>
39+
> uses a deterministic pseudo-random number generator. It is fast but not suitable for cryptographic applications...
40+
41+
因此,不应将其用于生成令牌或密码 (这种情况下应使用名为 `pgcrypto` 的库)。但它可以用于生成纯随机数据 (不用于混淆)。
42+
43+
从 Postgres 16 开始,还有 [random_normal(mean, stddev)](https://postgresql.org/docs/16/functions-math.html#FUNCTIONS-MATH-RANDOM-TABLE)
44+
45+
>Returns a random value from the normal distribution with the given parameters; `mean` defaults to 0.0 and `stddev` defaults to 1.0
46+
>
47+
>返回具有给定参数的正态分布的随机值;`mean` 默认为 0.0,`stddev` 默认为 1.0
48+
49+
生成 100 万个数字并检查它们的分布:
50+
51+
```sql
52+
nik=# with data (r) as (
53+
select random_normal()
54+
from generate_series(1, 1000000)
55+
)
56+
select
57+
width_bucket(r, -3, 3, 5) as bucket,
58+
count(*) as frequency
59+
from data
60+
group by bucket
61+
order by bucket;
62+
63+
bucket | frequency
64+
--------+-----------
65+
0 | 1374
66+
1 | 34370
67+
2 | 238786
68+
3 | 450859
69+
4 | 238601
70+
5 | 34627
71+
6 | 1383
72+
(7 rows)
73+
```
74+
75+
再次提醒,`random()``random_normal()` 都不应用作加密性强的随机数生成器 — 为此应使用 `pgcrypto`。否则,知道一个值后,可以"猜测"下一个值:
76+
77+
```sql
78+
nik=# set seed to 0.1234;
79+
SET
80+
81+
nik=# select random_normal();
82+
random_normal
83+
----------------------
84+
-0.32020779450641174
85+
(1 row)
86+
87+
nik=# select random_normal();
88+
random_normal
89+
--------------------
90+
0.8995226247227294
91+
(1 row)
92+
93+
nik=# set seed to 0.1234; -- start again
94+
SET
95+
96+
nik=# select random_normal();
97+
random_normal
98+
----------------------
99+
-0.32020779450641174
100+
(1 row)
101+
102+
nik=# select random_normal();
103+
random_normal
104+
--------------------
105+
0.8995226247227294
106+
(1 row)
107+
```
108+
109+
## 时间戳、日期、间隔
110+
111+
2024 年 1 月的时间戳 (带时区),从 2024-01-01 开始,每隔 7 天:
112+
113+
```sql
114+
nik=# show timezone;
115+
TimeZone
116+
---------------------
117+
America/Los_Angeles
118+
(1 row)
119+
120+
nik=# select i from generate_series(timestamptz '2024-01-01', timestamptz '2024-01-31', interval '7 day') i;
121+
i
122+
------------------------
123+
2024-01-01 00:00:00-08
124+
2024-01-08 00:00:00-08
125+
2024-01-15 00:00:00-08
126+
2024-01-22 00:00:00-08
127+
2024-01-29 00:00:00-08
128+
(5 rows)
129+
```
130+
131+
为上一周生成 3 个随机时间戳 (常用于填充如 `created_at` 等字段):
132+
133+
```sql
134+
nik=# select
135+
date_trunc('week', now())
136+
- interval '7 day'
137+
+ format('%s day', (random() * 7))::interval
138+
from generate_series(1, 3);
139+
140+
?column?
141+
-------------------------------
142+
2023-10-31 00:50:59.503352-07
143+
2023-11-03 11:25:39.770384-07
144+
2023-11-03 13:43:27.087973-07
145+
(3 rows)
146+
```
147+
148+
生成年龄在 18-100 岁的人的随机出生日期:
149+
150+
```sql
151+
nik=# select
152+
(
153+
now()
154+
- format('%s day', 365 * 18)::interval
155+
- format('%s day', 365 * random() * 82)::interval
156+
)::date;
157+
date
158+
------------
159+
1954-01-17
160+
(1 row)
161+
```
162+
163+
## 伪单词
164+
165+
生成由 2 到 12 个小写拉丁字母组成的伪单词:
166+
167+
```sql
168+
nik=# select string_agg(chr((random() * 25)::int + 97), '')
169+
from generate_series(1, 2 + (10 * random())::int);
170+
string_agg
171+
------------
172+
yegwrsl
173+
(1 row)
174+
175+
nik=# select string_agg(chr((random() * 25)::int + 97), '')
176+
from generate_series(1, 2 + (10 * random())::int);
177+
string_agg
178+
------------
179+
wusapjx
180+
(1 row)
181+
```
182+
183+
生成包含 5 到 10 个伪单词的"句子":
184+
185+
```sql
186+
nik=# select string_agg(w, ' ')
187+
from
188+
generate_series(1, 5) as i,
189+
lateral (
190+
select string_agg(chr((random() * 25)::int + 97), ''), i
191+
from generate_series(1, 2 + (10 * random())::int + i - i)
192+
) as words(w);
193+
string_agg
194+
-------------------------------------
195+
uvo bwp kcypvcnctui tn demkfnxruwxk
196+
(1 row)
197+
```
198+
199+
注意 `LATERAL` 和引用外部生成器的"空闲"方式 (`i``+i - i`),以确保每次迭代都生成新随机值。
200+
201+
## 正常单词、名字、电子邮件、社会安全号码等 (Faker)
202+
203+
[Faker](https://faker.readthedocs.io/en/master/) 是一个 Python 库,用于生成虚假数据,如姓名、地址、电话号码等。其他语言的 Faker 版本有:
204+
205+
- [Faker for Ruby](https://github.com/faker-ruby/faker)
206+
- [Faker for Go](https://github.com/go-faker/faker)
207+
- [Faker for Java](https://github.com/DiUS/java-faker)
208+
- [Faker for Rust](https://github.com/cksac/fake-rs)
209+
- [Faker for JavaScript](https://github.com/faker-js/faker)
210+
211+
你可以通过以下几种方式使用 Python 的 Faker:
212+
213+
1. 使用 Python 程序和 Postgres 连接 (常规方法,但适用于任何 Postgres 实例,包括 RDS)。
214+
2. [postgresql_faker](https://gitlab.com/dalibo/postgresql_faker/)
215+
3. PL/Python 函数。
216+
217+
此处我们展示如何使用第三种方式,使用 `plpython3u` 的"不受信任"版本 ([Day 47: How to install Postgres 16 with plpython3u](https://gitlab.com/postgres-ai/postgresql-consulting/postgres-howtos/-/blob/main/0047_how_to_install_postgres_16_with_plpython3u.md); 不适用于托管的 Postgres 服务,如 RDS;注意这种情况下"受信任"版本应该也适合)。
218+
219+
```sql
220+
nik=# create or replace function generate_random_sentence(
221+
min_length int,
222+
max_length int
223+
) returns text
224+
as $$
225+
from faker import Faker
226+
import random
227+
228+
if min_length > max_length:
229+
raise ValueError('min_length > max_length')
230+
231+
fake = Faker()
232+
233+
sentence_length = random.randint(min_length, max_length)
234+
235+
return ' '.join(fake.words(nb=sentence_length))
236+
$$ language plpython3u;
237+
CREATE FUNCTION
238+
239+
nik=# select generate_random_sentence(7, 15);
240+
generate_random_sentence
241+
---------------------------------------------------------------------------------
242+
operation day down forward foreign left your anything clear age seven memory as
243+
(1 row)
244+
```
245+
246+
生成名字、电子邮件和社会安全号码的函数:
247+
248+
```sql
249+
nik=# create or replace function generate_faker_data(
250+
data_type text,
251+
locale text default 'en_US'
252+
)
253+
returns text as $$
254+
from faker import Faker
255+
256+
fake = Faker(locale)
257+
258+
if data_type == 'email':
259+
result = fake.email()
260+
elif data_type == 'lastname':
261+
result = fake.last_name()
262+
elif data_type == 'firstname':
263+
result = fake.first_name()
264+
elif data_type == 'ssn':
265+
result = fake.ssn()
266+
else:
267+
raise Exception('Invalid type')
268+
269+
return result
270+
$$ language plpython3u;
271+
CREATE FUNCTION
272+
273+
nik=# select
274+
generate_faker_data('firstname', locale) as firstname,
275+
generate_faker_data('lastname', locale) as lastname,
276+
generate_faker_data('ssn') as "SSN";
277+
CREATE FUNCTION
278+
279+
nik=# select
280+
locale,
281+
generate_faker_data('firstname', locale) as firstname,
282+
generate_faker_data('lastname', locale) as lastname
283+
from
284+
(values ('en_US'), ('uk_UA'), ('it_IT')) as _(locale);
285+
locale | firstname | lastname
286+
--------+-----------+-----------
287+
en_US | Ashley | Rodgers
288+
uk_UA | Анастасія | Матвієнко
289+
it_IT | Carolina | Donatoni
290+
(3 rows)
291+
292+
nik=# select generate_faker_data('ssn');
293+
generate_faker_data
294+
---------------------
295+
008-47-2950
296+
(1 row)
297+
298+
nik=# select generate_faker_data('email', 'es') from generate_series(1, 5);
299+
generate_faker_data
300+
-------------------------
301+
isidoro42@example.net
302+
anselma04@example.com
303+
torreatilio@example.com
304+
natanael39@example.org
305+
teodosio79@example.net
306+
(5 rows)
307+
```

0 commit comments

Comments
 (0)