Skip to content

Commit 7299096

Browse files
committed
New implementation of mb_strimwidth
This new implementation of mb_strimwidth uses the new text encoding conversion filters. Changes from the previous implementation: • mb_strimwidth allows a negative 'from' argument, which should count backwards from the end of the string. However, the implementation of this feature was buggy (starting right from when it was first implemented). It used the following code: if ((from < 0) || (width < 0)) { swidth = mbfl_strwidth(&string); } if (from < 0) { from += swidth; } Do you see the bug? 'from' is a count of CODEPOINTS, but 'swidth' is a count of TERMINAL COLUMNS. Adding those two together does not make sense. If there were no fullwidth characters in the input string, then the two counts coincide and the feature would work correctly. However, each fullwidth character would throw the result off by one, causing more characters to be skipped than was requested. • mb_strimwidth also allows a negative 'width' argument, which again counts backwards from the end of the string; in this case, it is not determining the START of the portion which we want to extract, but rather, the END of that portion. Perhaps unsurprisingly, this feature was also buggy. Code: if (width < 0) { width = swidth + width - from; } 'swidth + width' is fine here; the problem is '- from'. Again, that is subtracting a count of CODEPOINTS from a count of TERMINAL COLUMNS. In this case, we really need to count the terminal width of the string prefix skipped over by 'from', and subtract that rather than the number of codepoints which are being skipped. As a result, if a 'from' count was passed along with a negative 'width', for every fullwidth character in the skipped prefix, the result of mb_strimwidth was one terminal column wider than requested. Since these situations were covered by unit tests, you might wonder why the bugs were not caught. Well, as far as I can see, it looks like the author of the 'tests' just captured the actual output of mb_strimwidth and defined it as 'correct'. The tests were written in such a way that it was difficult to examine them and see whether they made sense or not; but a careful examination of the inputs and outputs clearly shows that the legacy tests did not conform to the documented contract of mb_strimwidth. • The old implementation would always pass the input string through decoding/encoding filters before returning it to the caller, even if it fit within the specified width. This means that invalid byte sequences would be converted to error markers. For performance, the new implementation returns the very same string which was passed in if it does not exceed the specified width. This means that erroneous byte sequences are not converted to error markers unless it is necessary to trim the string. • The same applies to the 'trim marker' string. • The old implementation was buggy in the (unusual) case that the trim marker is wider than the requested maximum width of the result. It did an unsigned subtraction of the requested width and the width of the trim marker. If the width of the trim marker was greater, that subtraction would underflow and yield a huge number. As a result, mb_strimwidth would then pass the input string through, even if it was far wider than the requested maximum width. In that case, since the input string is wider than the requested width, and NONE of it will fit together with the trim marker, the new implementation returns just the trim marker. This is the one case where the output can be wider than the requested width: when BOTH the input string and also the trim marker are too wide. • Since it passed the input string and trim marker through decoding/encoding filters, when using "Quoted-Printable" as the encoding, newlines could be inserted into the trim marker to maintain the maximum line length for QP. This is an extremely bizarre use case and I don't think there is any point in worrying about it. QP will be removed from mbstring in time, anyways. PERFORMANCE: • From micro-benchmarking with various input string lengths and text encodings, it appears that the new implementation is 2-3x faster for UTF-8 and UTF-16. For legacy Japanese text encodings like ISO-2022-JP or SJIS, the new implementation is perhaps 25% faster. • Note that correctly implementing negative 'from' and 'width' arguments imposes a small performance burden in such cases; one which the old implementation did not pay. This slightly skews benchmarking results in favor of the old implementation. However, even so, the new implementation is faster in all cases which I tested.
1 parent 94fde15 commit 7299096

File tree

4 files changed

+450
-282
lines changed

4 files changed

+450
-282
lines changed

ext/mbstring/libmbfl/mbfl/mbfilter.c

Lines changed: 0 additions & 200 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,6 @@
9393
#include "filters/mbfilter_singlebyte.h"
9494
#include "filters/mbfilter_utf8.h"
9595

96-
#include "eaw_table.h"
9796
#include "rare_cp_bitvec.h"
9897

9998
/*
@@ -1191,205 +1190,6 @@ mbfl_strcut(
11911190
return result;
11921191
}
11931192

1194-
/* Some East Asian characters, when printed at a terminal (or the like), require double
1195-
* the usual amount of horizontal space. We call these "fullwidth" characters. */
1196-
static size_t character_width(unsigned int c)
1197-
{
1198-
if (c < FIRST_DOUBLEWIDTH_CODEPOINT) {
1199-
return 1;
1200-
}
1201-
1202-
/* Do a binary search to see if we fall in any of the fullwidth ranges */
1203-
int lo = 0, hi = sizeof(mbfl_eaw_table) / sizeof(mbfl_eaw_table[0]);
1204-
while (lo < hi) {
1205-
int probe = (lo + hi) / 2;
1206-
if (c < mbfl_eaw_table[probe].begin) {
1207-
hi = probe;
1208-
} else if (c > mbfl_eaw_table[probe].end) {
1209-
lo = probe + 1;
1210-
} else {
1211-
return 2;
1212-
}
1213-
}
1214-
1215-
return 1;
1216-
}
1217-
1218-
size_t mbfl_strwidth(mbfl_string *string)
1219-
{
1220-
if (!string->len) {
1221-
return 0;
1222-
}
1223-
1224-
size_t width = 0;
1225-
uint32_t wchar_buf[128];
1226-
unsigned char *in = string->val;
1227-
size_t in_len = string->len;
1228-
unsigned int state = 0;
1229-
1230-
while (in_len) {
1231-
size_t out_len = string->encoding->to_wchar(&in, &in_len, wchar_buf, 128, &state);
1232-
while (out_len) {
1233-
/* NOTE: 'bad input' marker will be counted as 1 unit of width
1234-
* If text conversion is performed with an ordinary ASCII character as
1235-
* the 'replacement character', this will give us the correct display width. */
1236-
width += character_width(wchar_buf[--out_len]);
1237-
}
1238-
}
1239-
1240-
return width;
1241-
}
1242-
1243-
1244-
/*
1245-
* strimwidth
1246-
*/
1247-
struct collector_strimwidth_data {
1248-
mbfl_convert_filter *decoder;
1249-
mbfl_convert_filter *decoder_backup;
1250-
mbfl_memory_device device;
1251-
size_t from;
1252-
size_t width;
1253-
size_t outwidth;
1254-
size_t outchar;
1255-
size_t endpos;
1256-
int status;
1257-
};
1258-
1259-
static int
1260-
collector_strimwidth(int c, void* data)
1261-
{
1262-
struct collector_strimwidth_data *pc = (struct collector_strimwidth_data*)data;
1263-
1264-
switch (pc->status) {
1265-
case 10:
1266-
(*pc->decoder->filter_function)(c, pc->decoder);
1267-
break;
1268-
default:
1269-
if (pc->outchar >= pc->from) {
1270-
pc->outwidth += character_width(c);
1271-
1272-
if (pc->outwidth > pc->width) {
1273-
if (pc->status == 0) {
1274-
pc->endpos = pc->device.pos;
1275-
mbfl_convert_filter_copy(pc->decoder, pc->decoder_backup);
1276-
}
1277-
pc->status++;
1278-
(*pc->decoder->filter_function)(c, pc->decoder);
1279-
pc->outchar++;
1280-
return -1;
1281-
} else {
1282-
(*pc->decoder->filter_function)(c, pc->decoder);
1283-
}
1284-
}
1285-
pc->outchar++;
1286-
break;
1287-
}
1288-
1289-
return 0;
1290-
}
1291-
1292-
mbfl_string *
1293-
mbfl_strimwidth(
1294-
mbfl_string *string,
1295-
mbfl_string *marker,
1296-
mbfl_string *result,
1297-
size_t from,
1298-
size_t width)
1299-
{
1300-
struct collector_strimwidth_data pc;
1301-
mbfl_convert_filter *encoder;
1302-
size_t n, mkwidth;
1303-
unsigned char *p;
1304-
1305-
mbfl_string_init(result);
1306-
result->encoding = string->encoding;
1307-
mbfl_memory_device_init(&pc.device, MIN(string->len, width), 0);
1308-
1309-
/* output code filter */
1310-
pc.decoder = mbfl_convert_filter_new(
1311-
&mbfl_encoding_wchar,
1312-
string->encoding,
1313-
mbfl_memory_device_output, 0, &pc.device);
1314-
pc.decoder_backup = mbfl_convert_filter_new(
1315-
&mbfl_encoding_wchar,
1316-
string->encoding,
1317-
mbfl_memory_device_output, 0, &pc.device);
1318-
/* wchar filter */
1319-
encoder = mbfl_convert_filter_new(
1320-
string->encoding,
1321-
&mbfl_encoding_wchar,
1322-
collector_strimwidth, 0, &pc);
1323-
if (pc.decoder == NULL || pc.decoder_backup == NULL || encoder == NULL) {
1324-
mbfl_convert_filter_delete(encoder);
1325-
mbfl_convert_filter_delete(pc.decoder);
1326-
mbfl_convert_filter_delete(pc.decoder_backup);
1327-
return NULL;
1328-
}
1329-
mkwidth = 0;
1330-
if (marker) {
1331-
mkwidth = mbfl_strwidth(marker);
1332-
}
1333-
pc.from = from;
1334-
pc.width = width - mkwidth;
1335-
pc.outwidth = 0;
1336-
pc.outchar = 0;
1337-
pc.status = 0;
1338-
pc.endpos = 0;
1339-
1340-
/* feed data */
1341-
p = string->val;
1342-
n = string->len;
1343-
if (p != NULL) {
1344-
while (n > 0) {
1345-
n--;
1346-
if ((*encoder->filter_function)(*p++, encoder) < 0) {
1347-
break;
1348-
}
1349-
}
1350-
mbfl_convert_filter_flush(encoder);
1351-
if (pc.status != 0 && mkwidth > 0) {
1352-
pc.width += mkwidth;
1353-
if (n > 0) {
1354-
while (n > 0) {
1355-
if ((*encoder->filter_function)(*p++, encoder) < 0) {
1356-
break;
1357-
}
1358-
n--;
1359-
}
1360-
mbfl_convert_filter_flush(encoder);
1361-
} else if (pc.outwidth > pc.width) {
1362-
pc.status++;
1363-
}
1364-
if (pc.status != 1) {
1365-
pc.status = 10;
1366-
pc.device.pos = pc.endpos;
1367-
mbfl_convert_filter_copy(pc.decoder_backup, pc.decoder);
1368-
mbfl_convert_filter_reset(encoder, marker->encoding, &mbfl_encoding_wchar);
1369-
p = marker->val;
1370-
n = marker->len;
1371-
while (n > 0) {
1372-
if ((*encoder->filter_function)(*p++, encoder) < 0) {
1373-
break;
1374-
}
1375-
n--;
1376-
}
1377-
mbfl_convert_filter_flush(encoder);
1378-
}
1379-
} else if (pc.status != 0) {
1380-
pc.device.pos = pc.endpos;
1381-
mbfl_convert_filter_copy(pc.decoder_backup, pc.decoder);
1382-
}
1383-
mbfl_convert_filter_flush(pc.decoder);
1384-
}
1385-
result = mbfl_memory_device_result(&pc.device, result);
1386-
mbfl_convert_filter_delete(encoder);
1387-
mbfl_convert_filter_delete(pc.decoder);
1388-
mbfl_convert_filter_delete(pc.decoder_backup);
1389-
1390-
return result;
1391-
}
1392-
13931193

13941194
/*
13951195
* MIME header encode

ext/mbstring/libmbfl/mbfl/mbfilter.h

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -231,18 +231,6 @@ mbfl_substr(mbfl_string *string, mbfl_string *result, size_t from, size_t length
231231
MBFLAPI extern mbfl_string *
232232
mbfl_strcut(mbfl_string *string, mbfl_string *result, size_t from, size_t length);
233233

234-
/*
235-
* strwidth
236-
*/
237-
MBFLAPI extern size_t
238-
mbfl_strwidth(mbfl_string *string);
239-
240-
/*
241-
* strimwidth
242-
*/
243-
MBFLAPI extern mbfl_string *
244-
mbfl_strimwidth(mbfl_string *string, mbfl_string *marker, mbfl_string *result, size_t from, size_t width);
245-
246234
/*
247235
* MIME header encode
248236
*/

0 commit comments

Comments
 (0)