diff options
author | Tao Klerks <tao@klerks.biz> | 2022-04-04 08:50:36 +0300 |
---|---|---|
committer | Junio C Hamano <gitster@pobox.com> | 2022-04-06 22:59:58 +0300 |
commit | fbe5f6b80437adbcd58af1b3751b830910a2ddaa (patch) | |
tree | 21ee632bc2823430943d7f5ff4af33591a06637f /git-p4.py | |
parent | faa21c10d44184f616d391c158dcbb13b9c72ef3 (diff) |
git-p4: preserve utf8 BOM when importing from p4 to git
Perforce has a file type "utf8" which represents a text file with
explicit BOM. utf8-encoded files *without* BOM are stored as
regular file type "text". The "utf8" file type behaves like text
in all but one important way: it is stored, internally, without
the leading 3 BOM bytes.
git-p4 has historically imported utf8-with-BOM files (files stored,
in Perforce, as type "utf8") the same way as regular text files -
losing the BOM in the process.
Under most circumstances this issue has little functional impact,
as most systems consider the BOM to be optional and redundant, but
this *is* a correctness failure, and can have lead to practical
issues for example when BOMs are explicitly included in test files,
for example in a file encoding test suite.
Fix the handling of utf8-with-BOM files when importing changes from
p4 to git, and introduce a test that checks it is working correctly.
Signed-off-by: Tao Klerks <tao@klerks.biz>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Diffstat (limited to 'git-p4.py')
-rwxr-xr-x | git-p4.py | 10 |
1 files changed, 10 insertions, 0 deletions
@@ -2885,6 +2885,16 @@ class P4Sync(Command, P4UserMap): print("\nIgnoring apple filetype file %s" % file['depotFile']) return + if type_base == "utf8": + # The type utf8 explicitly means utf8 *with BOM*. These are + # streamed just like regular text files, however, without + # the BOM in the stream. + # Therefore, to accurately import these files into git, we + # need to explicitly re-add the BOM before writing. + # 'contents' is a set of bytes in this case, so create the + # BOM prefix as a b'' literal. + contents = [b'\xef\xbb\xbf' + contents[0]] + contents[1:] + # Note that we do not try to de-mangle keywords on utf16 files, # even though in theory somebody may want that. regexp = p4_keywords_regexp_for_type(type_base, type_mods) |