Wednesday, March 21, 2012

How to load a Unicode file into the database in the same order as the file order

The data file is a simple Unicode file with lines of text. BCP
apparently doesn't guarantee this ordering, and neither does the
import tool. I want to be able to load the data either sequentially or
add line numbering to large Unicode file (1 million lines). I don't
want to deal with another programming language if possible and I
wonder if there's a trick in SQL Server to get this accomplished.
Thanks for any help.
Mark Leary
--== Posted via webservertalk.com - Unlimited-Uncensored-Secure Usenet News==--
http://www.webservertalk.com The #1 Newsgroup Service in the World! >100,000 Newsgroups
--= East/West-Coast Server Farms - Total Privacy via Encryption =--
no-email wrote:
> The data file is a simple Unicode file with lines of text. BCP
> apparently doesn't guarantee this ordering, and neither does the
> import tool. I want to be able to load the data either sequentially or
> add line numbering to large Unicode file (1 million lines). I don't
> want to deal with another programming language if possible and I
> wonder if there's a trick in SQL Server to get this accomplished.
> Thanks for any help.
> Mark Leary
>
Why does the order of the rows inserted into the table matter in your
case? Relational databases don't understand row order. If you need them
sorted in some way after the import, you can create a clustered index on
the table to get the rows ordered in a way that helps your queries
perform better.
In general, I think BCP processes the rows in the file sequentially. But
again, I'm not clear on why this matters.
Could you elaborate on the exact issue you are trying to avoid.
David Gugick
Imceda Software
www.imceda.com
|||"David Gugick" <davidg-nospam@.imceda.com> wrote:

> Why does the order of the rows inserted into the table matter in your
> case? Relational databases don't understand row order. If you need them
> sorted in some way after the import, you can create a clustered index on
> the table to get the rows ordered in a way that helps your queries perform
> better.
> In general, I think BCP processes the rows in the file sequentially. But
> again, I'm not clear on why this matters.
> Could you elaborate on the exact issue you are trying to avoid.
I am trying to load a text file sequentially in order to perform text
manipulations using T-SQL that do depend on the exact order. I would be
happy with simply adding a line number to each line of the Unicode text
file, and then loading the file with line number determining the order, but
I want to avoid programming in another language if possible. Eventually the
loaded text would be converted to proper relational tables. This doesn't
have to do with improving performance. Does this help?
Thanks.
|||no-email wrote:
> "David Gugick" <davidg-nospam@.imceda.com> wrote:
>
> I am trying to load a text file sequentially in order to perform text
> manipulations using T-SQL that do depend on the exact order. I would
> be happy with simply adding a line number to each line of the Unicode
> text file, and then loading the file with line number determining the
> order, but I want to avoid programming in another language if
> possible. Eventually the loaded text would be converted to proper
> relational tables. This doesn't have to do with improving
> performance. Does this help?
> Thanks.
Yes. It sounds like you have rows in a specific order that will need to
be processed once on SQL Server. You want to preserve the order of rows
in the file so the rows can be processed in the same order once on SQL
Server.
In order to do this in any relational database, you need a sort key.
There is never a guarantee that a query you run without an ORDER BY will
return rows in the same order in any consistent way.
My understanding is that BCP feeds the rows in the order they appear in
a file. I can't imagine any reason it would or could do it differently.
In that case, you want to insert the data into a table that contains an
IDENTITY column. You can then use that key for your ORDER BY when
processing the rows from whatever process does that.
David Gugick
Imceda Software
www.imceda.com
|||Do you know what order the source file is sorted in? If so, and if the
sort order column(s) are included then you may not need to know the
line number since it is (theoretically anyway) possible to derive that
information from the other data.
If not, then this article has a useful suggestion:
http://www.google.co.uk/groups?selm=...GP11.phx.gb l
not sure if that will work with unicode data though.
David Portas
SQL Server MVP
|||"David Portas" <REMOVE_BEFORE_REPLYING_dportas@.acm.org> wrote in
message:

> Do you know what order the source file is sorted in? If so, and if
the
> sort order column(s) are included then you may not need to know the
> line number since it is (theoretically anyway) possible to derive
that
> information from the other data.
Unfortunately the data file consists of simple lines of text with no
other way to extract potential column information before loading it
into the database.

> If not, then this article has a useful suggestion:
>
http://www.google.co.uk/groups?selm=...GP11.phx.gb l
> not sure if that will work with unicode data though.
Good suggestion but it does fail with Unicode. Thanks anyway.
--== Posted via webservertalk.com - Unlimited-Uncensored-Secure Usenet News==--
http://www.webservertalk.com The #1 Newsgroup Service in the World! >100,000 Newsgroups
--= East/West-Coast Server Farms - Total Privacy via Encryption =--
|||"David Gugick" <davidg-nospam@.imceda.com> wrote:

> My understanding is that BCP feeds the rows in the order they appear
in
> a file. I can't imagine any reason it would or could do it
differently.
> In that case, you want to insert the data into a table that contains
an
> IDENTITY column. You can then use that key for your ORDER BY when
> processing the rows from whatever process does that.
In general BCP loads the data in the same order as the file but not
always. The ordering sometimes reverses for thousands of rows, or
skips certain rows, but you need to check it carefully to find the
misordering. You can create a table with an identity column and load
the data, but again if the rows are not loaded in the same sequence as
the file this won't matter. You will end up with an ordered table that
is unfortunately not in the same order as the original file.
To be honest I have tried all these suggestions in the past. My
typical solution would be to open the original file in Excel, add a
rownumber column and then save the resulting file as a Unicode file.
This works up until around a maximum of 63,000 rows. You can break a
file into 63,000 row subfiles, but this would be too tedious if you
have row counts approaching a million.
Thanks for the suggestions but I may have to learn some C#.
Unfortunately Visual Basic has problems with Unicode, as I suspect
also C++ and C have similar problems.
--== Posted via webservertalk.com - Unlimited-Uncensored-Secure Usenet News==--
http://www.webservertalk.com The #1 Newsgroup Service in the World! >100,000 Newsgroups
--= East/West-Coast Server Farms - Total Privacy via Encryption =--
|||What about using the CTS Import Wizard to import the data from the flat
file into a table with an identity.
Where is this data coming from? Is there any way to recreate it with a
counter column included in the output?
David Gugick
Imceda Software
www.imceda.com
|||"David Gugick" <davidg-nospam@.imceda.com> wrote:

> What about using the CTS Import Wizard to import the data from the
flat
> file into a table with an identity.
It's the same problem. The table order generally follows the order in
the file but not always.

> Where is this data coming from? Is there any way to recreate it with
a
> counter column included in the output?
It's foreign language dictionary data that cannot be recreated. I
could manipulate the data on the file level but I am trying to avoid
potential problems with Unicode. Once the data gets into SQL Server I
don't have any problems, as long as the table order exactly matches
the file order.
--== Posted via webservertalk.com - Unlimited-Uncensored-Secure Usenet News==--
http://www.webservertalk.com The #1 Newsgroup Service in the World! >100,000 Newsgroups
--= East/West-Coast Server Farms - Total Privacy via Encryption =--
|||no-email wrote:
> It's foreign language dictionary data that cannot be recreated. I
> could manipulate the data on the file level but I am trying to avoid
> potential problems with Unicode. Once the data gets into SQL Server I
> don't have any problems, as long as the table order exactly matches
> the file order.
I'm not sure what problems you would have as long as the tool you are
using to edit the data is unicode aware. I use TextEdit for editing
(www.textpad.com) and it has simple replacement expressions.
Assuming you had each row of data on a single line, you could simply do
the following:
1- Add a leading CARRIAGE RETURN to the file
2- Open the Replace dialog
3- Check the Regular Expression option
4- Type "\n" - WITHOUT QUOTES in the Find What entry- means New Line
character
5- Type "\n\i\t" - WITHOUT QUOTES in the Replace With entry - means New
Line + Auto Number + TAB
6- Click Replace All
7 - Remove the leading carriage return in the file
8 - Click FILE SAVE AS and make sure the UNICODE option is selected
You can replace the TAB character with whatever your file requires or
add DOUBLE QUOTES around the Auto Number, etc.
You can download a free trial of TextPad on the web site.
David Gugick
Imceda Software
www.imceda.com

No comments:

Post a Comment