Oct 11, 2007

ASCII/Multibyte to Unicode conversion

 
ASP.NET, All Topics, C# , .NET >> C# Database >> Unedited Reader Contributions
http://www.codeproject.com/useritems/ASCIIMB2UnicodeTSQL.asp

ASCII/Multibyte to Unicode conversion
By carlop().

An article about converting ASCII/Multibyte to unicode using C# and SQL/CLR Integration
C# (C# 2.0), SQL
Windows (WinXP), .NET (.NET 2.0)
Win32, SQL (SQL 2005), VS
DB, Dev
Posted: 8 Feb 2007
Updated: 16 Feb 2007
Views: 15,212
Note: This is an unedited reader contribution  

7 votes for this article.
Popularity: 2.62. Rating: 3.1 out of 5.

Note: This is an unedited contribution. If this article is inappropriate, needs attention or copies someone else's work without reference then please Report this article.

Sample image

Introduction

A common problem when converting legacy database applications is that our customers have a lot of data and they don't want to loose it. Probably data strings are encoded in ASCII / Multibyte but our new system uses NVARCHAR fields. So the problem is that we have to convert old data into Unicode. This is exactly what we'll do.

I have to highlight that the approach I will present is 'oversized' for most cases: in a common scenario you will deal with a single codepage, in that case SQL Server built-in tools are enough. But there are situations that requires a more advanced approach, such as when you have a legacy database that stores strings with different codepages.

As a complete example to show you capabilities of SQL-CLR integration I've also decided to use a Win32 API to perform the conversion, so that you can also see how use P/Invoke from SQL. This can be useful if you have an old DLL and want to use it... but *beware* that it can be really dangerous... if you don't have full knowledge of what you're doing you can keep down the entire SQL Server process !!!

Just a note: I do not provide a test project since the code I will show here is really simple, you can use copy & paste, it's faster !

Solution

The solution I want to show is simple and powerful at the same time. It is made of two logical parts: First we'll build the conversion routine as a standard C# function. Secondly we'll integrate it into SQL Server as a function, so you can use with T-SQL.

Clearly, depending on you application scenario, this may not be the best approach, so once you have the conversion routine you may choose to follow my approach or you may prefer to use it in an external application that performs a batch conversion or you may think of something else.

Part #1: create the conversion routine.

I've used the following well known Win32 API:

int MultiByteToWideChar(
UINT CodePage,
DWORD dwFlags,
LPCSTR lpMultiByteStr,
int cbMultiByte,
LPWSTR lpWideCharStr,
int cchWideChar
);

Here there's the C# signature that can be used for P/Invoking:

  [DllImport("kernel32.dll")]
private static extern int MultiByteToWideChar(
uint CodePage,
uint dwFlags,
[MarshalAs(UnmanagedType.LPArray)] Byte[] lpMultiByteStr,
int cbMultiByte,
[Out, MarshalAs(UnmanagedType.LPArray)] Byte[] lpWideCharStr,
int cchWideChar);

Now it's quite easy to write a class that performs a conversion:

namespace ConvUtils {

public static class Unicode {

public static SqlString ConvToUnicode(SqlInt32 codepage , SqlString multibyteString) {
byte[] b = (byte[])iConvToMultibyteArray(multibyteString);
return (SqlString)ToUnicode((uint)(int)codepage, b);
}

private static string ToUnicode(uint codepage, Byte[] lpMultiByteStr) {
Byte[] lpWideCharStr = new Byte[2*lpMultiByteStr.Length];
MultiByteToWideChar(codepage, 0, lpMultiByteStr, lpMultiByteStr.Length, lpWideCharStr, 2*lpMultiByteStr.Length);
return System.Text.Encoding.Unicode.GetString(lpWideCharStr);
}


private static SqlBinary iConvToMultibyteArray(SqlString multibyteString) {
byte[] result = multibyteString.GetUnicodeBytes();
return (SqlBinary)result;
}
}
}

The example is quite easy and does not require any other explanation. The SqlTypes has been used because I will integrate it into SqlServer, but if you don't need it you may replace them with strings and Byte arrays.

I've also decided to let the codepage be a parameter since the codepage on your PC or on SQL Server can be different from the one needed for the conversion.

Part #2: SQL Server code

Ok, now we have a C# code that converts ASCII/Multibyte into Unicode. The next step is to integrate it into SQL Server so that any database user can have access to this conversion routine:

First create a DLL that can be hosted by SQL Server: all we have to do is to add the [SQLFunction] attribute to the ConvToUnicode :

    [SQLFunction] 
public static SqlString ConvToUnicode( …

Finally build the DLL and integrate it into SQL with a script that is like the following one:

use TESTDB 
go
exec sp_configure "clr enabled", '1'
go
reconfigure
go
-- for test purpouses only, not recommended in production environments
ALTER DATABASE TESTDB SET TRUSTWORTHY ON
go

begin try
CREATE ASSEMBLY [asmUni] FROM 'c:\project_output_dir\uniconv.dll' with permission_set=UNSAFE
end try
begin catch
alter assembly [asmUni] FROM 'c:\project_output_dir\uniconv.dll' WITH UNCHECKED DATA
end catch
go

if exists(select name from sys.objects where name = 'csConvToUnicode') drop function [dbo].[csConvToUnicode]
go
CREATE FUNCTION [dbo].[csConvToUnicode] (
@codepage int,
@multibytestr nvarchar(max)
) returns nvarchar(max)
AS EXTERNAL name [asmUni].[ConvUtils.Unicode].[ConvToUnicode]
go

That's all !

Now you can use this function like any other, for example in a classic select statement, to create a view or to create a trigger that automatically keeps your data updated.

Just a final example on how we can use this function in a T-SQL statement (950 is the codepage for traditional chinese):

select 
description,
dbo.csConvToUnicode(950, description) as converted
from testtable

description converted
----------------------- -------------------
¨àµ£ºô¸ô¦w¥þ 兒童網路安全
°ê»Ú¸ê°T °T°T°T°T°T 國際資訊 訊訊訊訊訊
°ê»Ú¸ê°T °T°T°T°T°T 國際資訊 訊訊訊訊訊
a a
Áô¨pÅv¬Fµ¦ 隱私權政策
test c test c

Conclusion

I have shown a really simple but powerful way to leverage the CLR integration provided by SqlServer to convert old ASCII/Multibyte data into Unicode. I Hope that this is a good starting point for your personal solution.

History

February 16th, 2007 - Added more comments on introduction
February 8th, 2007 - First Version

About carlop()


10 years of hard coding... and it's going on !

Click here to view carlop()'s online profile.



Add this article to:

Discussions and Feedback

14 comments have been posted for this article. Visit http://www.codeproject.com/useritems/ASCIIMB2UnicodeTSQL.asp to post and view comments on this article.
Updated: 16 Feb 2007 Article content copyright carlop(), 2007
everything else Copyright © CodeProject, 1999-2007.

No comments:

Post a Comment