git-p4: preserve utf8 BOM when importing from p4 to git

Perforce has a file type "utf8" which represents a text file with explicit BOM. utf8-encoded files *without* BOM are stored as regular file type "text". The "utf8" file type behaves like text in all but one important way: it is stored, internally, without the leading 3 BOM bytes. git-p4 has historically imported utf8-with-BOM files (files stored, in Perforce, as type "utf8") the same way as regular text files - losing the BOM in the process. Under most circumstances this issue has little functional impact, as most systems consider the BOM to be optional and redundant, but this *is* a correctness failure, and can have lead to practical issues for example when BOMs are explicitly included in test files, for example in a file encoding test suite. Fix the handling of utf8-with-BOM files when importing changes from p4 to git, and introduce a test that checks it is working correctly. Signed-off-by: Tao Klerks <tao@klerks.biz> Signed-off-by: Junio C Hamano <gitster@pobox.com>
author: Tao Klerks <tao@klerks.biz> 2022-04-04 08:50:36 +0300
committer: Junio C Hamano <gitster@pobox.com> 2022-04-06 22:59:58 +0300
commit: fbe5f6b80437adbcd58af1b3751b830910a2ddaa (patch)
tree: 21ee632bc2823430943d7f5ff4af33591a06637f /git-p4.py
parent: faa21c10d44184f616d391c158dcbb13b9c72ef3 (diff)
1 files changed, 10 insertions, 0 deletions
diff --git a/git-p4.py b/git-p4.py
index a9b1f90441..6d932e7ed7 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -2885,6 +2885,16 @@ class P4Sync(Command, P4UserMap):
             print("\nIgnoring apple filetype file %s" % file['depotFile'])
             return
 
+        if type_base == "utf8":
+            # The type utf8 explicitly means utf8 *with BOM*. These are
+            # streamed just like regular text files, however, without
+            # the BOM in the stream.
+            # Therefore, to accurately import these files into git, we
+            # need to explicitly re-add the BOM before writing.
+            # 'contents' is a set of bytes in this case, so create the
+            # BOM prefix as a b'' literal.
+            contents = [b'\xef\xbb\xbf' + contents[0]] + contents[1:]
+
         # Note that we do not try to de-mangle keywords on utf16 files,
         # even though in theory somebody may want that.
         regexp = p4_keywords_regexp_for_type(type_base, type_mods)
author	Tao Klerks <tao@klerks.biz>	2022-04-04 08:50:36 +0300
committer	Junio C Hamano <gitster@pobox.com>	2022-04-06 22:59:58 +0300
commit	fbe5f6b80437adbcd58af1b3751b830910a2ddaa (patch)
tree	21ee632bc2823430943d7f5ff4af33591a06637f /git-p4.py
parent	faa21c10d44184f616d391c158dcbb13b9c72ef3 (diff)