-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Fix GH-9535: The behavior of mb_strcut in mbstring has been changed in PHP8.1 #9562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @alexdowad |
Hi, @NathanFreeman. Thanks very much for working on mbstring. I am currently travelling and do not have a machine with me which is set up to build PHP, or I would pull your changes, build, and do some testing. In any case, the issue raised in #9535 is 100% a bug, and your changes do apparently fix it. Our test suite for Best would be to add a test which generates several thousands (or 10,000s) of random strings, of different lengths, and in different encodings, runs them through Set the RNG seed to a fixed value at the beginning of the test (with Does that sound like something which might be feasible? If not, please let me know. If you do it, it will be very interesting to see if any other bugs are found... |
Hi, @alexdowad I find that the above encodings will cause this bug. It seems that the used substitution character may not be supported by the target character encoding and use So after executing the filter_flush function I need to check the device.buffer changes to find out the location of the illegal subcharacters in it and modify the value of the device.pos. position = device.pos;
(*encoder->filter_flush)(encoder);
illegal_substchar = (unsigned char) encoder->illegal_substchar;
while(device.pos > position) {
/* check illegal output */
if (device.buffer[position++] == illegal_substchar) {
device.pos = position - 1;
break;
}
} |
Chinese Japanese and other languages are easier to trigger this bug after executing |
@NathanFreeman Thank you very much for taking the time to add more tests. I strongly suspected that more testing would reveal something interesting. (It usually does!) As mentioned, I am on a trip. Yesterday I took a bit of time to install everything needed on the laptop which I am carrying so that I can build PHP. I am just contemplating whether I should try to squeeze some time to step through this adjusted code in a debugger and understand the issue and fix in more detail. It would be easier once I get back home. In any case, your code does fix a bug and you have added a number of additional tests, which gives a bit more confidence that the fix doesn't break anything. I am thinking that I may approve the fix for now... I would just kindly like to ask that you amend the commit log message for the commit which adds the new tests. Right now we have two commits which are both called "fix #9535". |
I guess this fix will be merged into If @NathanFreeman can update the description for the 2nd commit as requested, I could merge, or else any other core developer could do it... |
f5bb86d
to
bff55bd
Compare
Hi, @alexdowad |
Before I merge, do any others who are interested in mbstring want to say anything? @cmb69 @nikic @kamil-tekiela |
There is no need to hurry. :) Next 8.1/8.2 RCs are only scheduled for tagging on 2022-10-11. Regarding the fix: could this cause issues if there are |
Hi, @cmb69 |
var_dump(mb_strcut('???', 0, 2, 'CP50220')); It's a special case. if ((encoder->status && ((encoder->status & 0xF) || (encoder->status == 0x11))) || encoder->cache) {
illegal_substchar = (unsigned char) encoder->illegal_substchar;
while (device.pos > position) {
/* check illegal output */
if (device.buffer[position++] == illegal_substchar) {
device.pos = position - 1;
break;
}
}
} |
Hi @NathanFreeman, and thank you again for diligently working on this issue. I think this time I may actually build PHP with your patch and step through in a debugger to see how it works. |
Hmm. I have built the code and walked through it to see how the fix works. I'm afraid that we may need to do more to get a robust fix, which will resolve this problem without breaking anything else. @NathanFreeman, before we go further,
|
@alexdowad Hi. Answer 1: The filters will process characters one by one according to certain rules, and modify the filter properties. |
1 is true. You should also note that some filters convert bytes in an "input" text encoding to wchars, while others convert wchars to bytes in an "output" text encoding. Both types are represented by the same C A "wchar" is a Unicode codepoint, stored in a 32-bit or larger value (such as a C To convert text from one encoding to another, we typically join together two filters, one to convert the input text to wchars, and another to convert the wchars to the desired destination encoding. 2: See php-src/ext/mbstring/libmbfl/filters/mbfilter_cp5022x.c Lines 553 to 566 in c417990
For some text encodings, it is necessary to look ahead to the next byte/wchar to know how to convert the current one. This is the case when converting wchars to CP50220, since some combinations of 2 codepoints are represented as a single "character" in CP50220. Since the conversion filters just receive one byte/wchar at a time, the only way they can compare with the following one is to store the current one (usually in However, what should then happen when we reach the end of the input string? After the last call to the filter function, there will still be a codepoint sitting in The answer is provided by the "flush functions". The flush functions are called at the end of a conversion operation, to "flush out" any cached data which is being held in (Since calling the flush functions is the last thing which is done before completing a conversion operation, it is also a good place to signal an error if the input string ended in an illegal state.) Do you see why
One more point... I have explained above what |
Thank you for your reply. |
@alexdowad @cmb69 Hi. |
Hi @NathanFreeman. I guess from your latest message that you are starting to understand more about how mbstring works? I think right now our concern is only about As far as the We need you to understand mbstring a bit more, then I think it will become clear what needs to be done. (Hint: there are better ways of suppressing error output from the Please let me know when you are ready to respond to my questions above. |
@alexdowad Hi. In shorts, The encoder let the decoder to check for the I read the details of |
@NathanFreeman 👍🏻 👍🏻 👍🏻 Sounds like a great idea. I suggest you set |
@alexdowad Hi. |
I'm happy with this. 👍 Thanks very much! @NathanFreeman Could you squash this series of 7 commits into one commit? 😅 And I guess the next question is... are you interested in working on mbstring some more? If so, I can give you another issue to work on. |
2bf0ae7
to
25c4244
Compare
@alexdowad Hi. |
@NathanFreeman I have just built this commit and ran the tests locally, I am getting a failure in the new test as follows:
I analyzed the reason for these failures. The expected test results which you put in The problem is that php-src/ext/mbstring/libmbfl/mbfl/mbfilter.c Line 1119 in b87a2e6
As well as this one: php-src/ext/mbstring/libmbfl/mbfl/mbfilter.c Line 1142 in b87a2e6
Notice that when it is deciding whether or not to continue consuming more bytes from the input string, This is problematic for a couple of reasons. For one thing, if there are invalid byte sequences in the middle of the requested range, those will be converted to error markers, and the error markers may have a different byte length, which would cause More to the point in this case, some of the flush functions emit escape sequences when ending a converted string. For JIS/ISO-2022-JP, this is the relevant bit: php-src/ext/mbstring/libmbfl/filters/mbfilter_jis.c Lines 449 to 454 in b87a2e6
For ISO-2022-JP-2004, it's here: php-src/ext/mbstring/libmbfl/filters/mbfilter_sjis_2004.c Lines 731 to 738 in b87a2e6
Hopefully the comment in the 2nd one makes it clear why this is done. (If it doesn't make sense, I will explain more.) Historical note: The added escape sequence at the end of a JIS/ISO-2022-JP string was part of the original code from libmbfl, but the one for ISO-2022-JP-2004 was added by me. While we want these ending escape sequences to appear in the output, we do not want their byte length to be subtracted from the size of the extracted portion. It seems to me that the condition Since @NathanFreeman is now setting the I'm thinking about this and it seems that it should work. Is there anything I'm missing? |
By the way, @NathanFreeman... since this is a separate issue from the one you are trying to fix in this PR... Another option would be that you put comments in the implementation code and test code explaining this issue, and we leave it to be resolved in another PR. It is a pre-existing bug and has nothing to do with your change. You could set the new test to |
Maybe it's better to split the test case into two test cases, and only mark the failing test as XFAIL. |
Hmm, update here. I tried implementing the idea mentioned above. After examining the inputs/outputs of some test cases byte-by-byte, I can say that the output seems more logical than before. As a side benefit, about 150 lines of very tricky (and brittle) code are eliminated. However, the output is significantly different for all encodings which have selectable 'modes', including HZ, all the ISO-2022 encodings, and CP5022{0,1,2}. Output is also significantly different for UTF-7 and UTF-7-IMAP. This is because in UTF-7, all non-ASCII characters are Base64-encoded, and Base64-encoding automatically pads the output to a multiple of 4 bytes. Since the legacy implementation of The output is also changed for GB18030, which seems odd, since GB18030 does not include anything like escape codes, nor does it pad its output to a certain byte length. I haven't dug in to see what the cause of the change for GB18030 is. Therefore, while this simplified implementation might arguably be considered more "correct", it would be a significant BC break. I can just imagine the complaints from users whose existing code would be broken. I am wondering if we should just define the existing behavior (i.e. deducting the length of an ending escape sequence from the number of bytes which were requested from the input string) to be "correct" and stick to it. If we do that, then @NathanFreeman could just adjust the expected values for the failing test cases so that the test passes. @NathanFreeman, another observation on your code: on line 1284, you restore the previous @cmb69 @Girgias @nikic Any comments on whether it is a good idea to define the historical behavior of |
@NathanFreeman If you are available to work on another mbstring issue, it would be helpful if you could resolve this one: The Unicode standard includes special casing (uppercase, lowercase, etc) rules for Greek, Lithuanian, Turkish, and Azeri. (https://unicode.org/Public/UNIDATA/SpecialCasing.txt) We do not implement any of these 'conditional mapping' rules right now. The most important conditional mapping which we would like to implement is the special case for the Greek letter sigma. This should only be done on If you look in |
It might make sense to consider this as a bug fix, but due to the BC only land this change in PHP 8.2 and master? (as we're still in an RC I think it still makes sense to patch 8.2) as producing a string which is not valid within the encoding sounds pretty bad to me... |
@Girgias for clarity, the current implementation of For example, if you indicate that you want to extract 15 bytes from the input string, |
@alexdowad Hi. For ISO-2022-JP-2004. php-src/ext/mbstring/libmbfl/filters/mbfilter_sjis_2004.c Lines 731 to 738 in b87a2e6
|
I have modified my test by following your suggestions. |
I have removed the unuseful code. Thanks. |
I don't think it is a good idea to keep the historical behavior for ever, if it sometimes doesn't make sense. Now, if we assume that some users rely on the current behavior, we should consider a deprecation phase (if feasible) telling users that a certain |
@NathanFreeman I would like to apologize that it took a very long time to merge this bug fix. Thanks for working on it. When merging, I squashed commits and edited the commit log message to make it more informative. I also updated NEWS and credited you for the fix there. Further, I also split the passing/failing tests into two different test files, as suggested by @cmb69, and put XFAIL only on the failing ones. Then merged up into |
OK. Not sure when this will happen. But definitely there are cases where the current (legacy) behavior of |
No description provided.