Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error writing data to the connection (2.5.1) / transactions reliability / database corruption [CORE3550] #3906

Open
firebird-automations opened this issue Jul 10, 2011 · 28 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: alainsegura (linlin13)

the connection to the database is lost with error "error writing data to the connection" while running this piece of code (B):
(Error message returned may vary from time to time)

If you re run the code (B), then, after the 2nd error, when you try to reconnect the database is corrupted (wrong page type)

A: Run the code
recreate global temporary table test (x varchar(20) not null, y varchar(20) not null) on commit preserve rows

B: Run the code
execute block as
declare variable i integer;
declare variable j integer;

begin
i=0;
while (i<500) do begin
execute statement 'recreate global temporary table test (x varchar(20) not null, y varchar(20) not null) on commit preserve rows';
execute statement 'create unique index xtest on test (x,y)';
execute statement 'grant all on test to FMUSER';

j=0;
while (j<10000) do begin
in autonomous transaction do execute statement 'insert into test (x,y) values ( ' || cast(i as varchar(20)) || ', ' || cast( j as varchar(20)) || ')';
j=j+1;
end

i=i+1;
end
end

@firebird-automations
Copy link
Collaborator Author

Modified by: alainsegura (linlin13)

description: the connection to the database is lost with error "error writing data to the connection" while running this piece of code:

execute block as
declare variable i integer;
declare variable j integer;

begin
i=0;
while (i<500) do begin
execute statement 'recreate global temporary table test (x varchar(20) not null, y varchar(20) not null) on commit preserve rows';
execute statement 'create unique index xtest on test (x,y)';
execute statement 'grant all on test to FMUSER';

j=0;
while (j<10000) do begin
in autonomous transaction do execute statement 'insert into test (x,y) values ( ' || cast(i as varchar(20)) || ', ' || cast( j as varchar(20)) || ')';
j=j+1;
end

i=i+1;
end
end

=>

the connection to the database is lost with error "error writing data to the connection" while running this piece of code:
(Error message returned by the engine vary from time to time)

execute block as
declare variable i integer;
declare variable j integer;

begin
i=0;
while (i<500) do begin
execute statement 'recreate global temporary table test (x varchar(20) not null, y varchar(20) not null) on commit preserve rows';
execute statement 'create unique index xtest on test (x,y)';
execute statement 'grant all on test to FMUSER';

j=0;
while (j<10000) do begin
in autonomous transaction do execute statement 'insert into test (x,y) values ( ' || cast(i as varchar(20)) || ', ' || cast( j as varchar(20)) || ')';
j=j+1;
end

i=i+1;
end
end

summary: Error writing data to the connection (2.5.1) => Error writing data to the connection (2.5.1) / transactions reliability

@firebird-automations
Copy link
Collaborator Author

Modified by: alainsegura (linlin13)

description: the connection to the database is lost with error "error writing data to the connection" while running this piece of code:
(Error message returned by the engine vary from time to time)

execute block as
declare variable i integer;
declare variable j integer;

begin
i=0;
while (i<500) do begin
execute statement 'recreate global temporary table test (x varchar(20) not null, y varchar(20) not null) on commit preserve rows';
execute statement 'create unique index xtest on test (x,y)';
execute statement 'grant all on test to FMUSER';

j=0;
while (j<10000) do begin
in autonomous transaction do execute statement 'insert into test (x,y) values ( ' || cast(i as varchar(20)) || ', ' || cast( j as varchar(20)) || ')';
j=j+1;
end

i=i+1;
end
end

=>

the connection to the database is lost with error "error writing data to the connection" while running this piece of code:
(Error message returned may vary from time to time)

execute block as
declare variable i integer;
declare variable j integer;

begin
i=0;
while (i<500) do begin
execute statement 'recreate global temporary table test (x varchar(20) not null, y varchar(20) not null) on commit preserve rows';
execute statement 'create unique index xtest on test (x,y)';
execute statement 'grant all on test to FMUSER';

j=0;
while (j<10000) do begin
in autonomous transaction do execute statement 'insert into test (x,y) values ( ' || cast(i as varchar(20)) || ', ' || cast( j as varchar(20)) || ')';
j=j+1;
end

i=i+1;
end
end

@firebird-automations
Copy link
Collaborator Author

Modified by: alainsegura (linlin13)

priority: Major [ 3 ] => Critical [ 2 ]

description: the connection to the database is lost with error "error writing data to the connection" while running this piece of code:
(Error message returned may vary from time to time)

execute block as
declare variable i integer;
declare variable j integer;

begin
i=0;
while (i<500) do begin
execute statement 'recreate global temporary table test (x varchar(20) not null, y varchar(20) not null) on commit preserve rows';
execute statement 'create unique index xtest on test (x,y)';
execute statement 'grant all on test to FMUSER';

j=0;
while (j<10000) do begin
in autonomous transaction do execute statement 'insert into test (x,y) values ( ' || cast(i as varchar(20)) || ', ' || cast( j as varchar(20)) || ')';
j=j+1;
end

i=i+1;
end
end

=>

the connection to the database is lost with error "error writing data to the connection" while running this piece of code:
(Error message returned may vary from time to time)

If you re run the code, then, after the 2nd error, when you try to reconnect the database is corrupted (wrong page type)

execute block as
declare variable i integer;
declare variable j integer;

begin
i=0;
while (i<500) do begin
execute statement 'recreate global temporary table test (x varchar(20) not null, y varchar(20) not null) on commit preserve rows';
execute statement 'create unique index xtest on test (x,y)';
execute statement 'grant all on test to FMUSER';

j=0;
while (j<10000) do begin
in autonomous transaction do execute statement 'insert into test (x,y) values ( ' || cast(i as varchar(20)) || ', ' || cast( j as varchar(20)) || ')';
j=j+1;
end

i=i+1;
end
end

summary: Error writing data to the connection (2.5.1) / transactions reliability => Error writing data to the connection (2.5.1) / transactions reliability / database corruption

@firebird-automations
Copy link
Collaborator Author

Modified by: alainsegura (linlin13)

description: the connection to the database is lost with error "error writing data to the connection" while running this piece of code:
(Error message returned may vary from time to time)

If you re run the code, then, after the 2nd error, when you try to reconnect the database is corrupted (wrong page type)

execute block as
declare variable i integer;
declare variable j integer;

begin
i=0;
while (i<500) do begin
execute statement 'recreate global temporary table test (x varchar(20) not null, y varchar(20) not null) on commit preserve rows';
execute statement 'create unique index xtest on test (x,y)';
execute statement 'grant all on test to FMUSER';

j=0;
while (j<10000) do begin
in autonomous transaction do execute statement 'insert into test (x,y) values ( ' || cast(i as varchar(20)) || ', ' || cast( j as varchar(20)) || ')';
j=j+1;
end

i=i+1;
end
end

=>

the connection to the database is lost with error "error writing data to the connection" while running this piece of code (B):
(Error message returned may vary from time to time)

If you re run the code (B), then, after the 2nd error, when you try to reconnect the database is corrupted (wrong page type)

A: Run the code
recreate global temporary table test (x varchar(20) not null, y varchar(20) not null) on commit preserve rows

B: Run the code
execute block as
declare variable i integer;
declare variable j integer;

begin
i=0;
while (i<500) do begin
execute statement 'recreate global temporary table test (x varchar(20) not null, y varchar(20) not null) on commit preserve rows';
execute statement 'create unique index xtest on test (x,y)';
execute statement 'grant all on test to FMUSER';

j=0;
while (j<10000) do begin
in autonomous transaction do execute statement 'insert into test (x,y) values ( ' || cast(i as varchar(20)) || ', ' || cast( j as varchar(20)) || ')';
j=j+1;
end

i=i+1;
end
end

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

You mixed DDL and DML in the same transaction, it never works.
I'll look why it crashed the engine but you must change your real application to not work as in this test case.

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

I was slightly wrong.

Really, your INSERT's are not in the same transaction as other DDL statements.
But DDL statements are deferred until commit time.
And INSERT's are run against *first* instance of TEST.
This case not works as you probably expect and have no sence for real usage.

But it should not crash the engine and this is the real bug.

@firebird-automations
Copy link
Collaborator Author

Commented by: alainsegura (linlin13)

Sure,
in the real code there is a "commit retaining" command between DDL and DML (the sofware is built with DELPHI/ MDO components).

This piece of code reproduce (+/-) the problem but isn't the exact reflect of the real code.

As you say, the engine must not DESTROY THE BATABASE whatever commands you pass to it.

The simple workaround at this time for me is to use regular tables instead of "global temporary tables".

Other thing:
if DDL are executed in separates transactions a have the error "not enough memory" (but no crash of the engine, and no crash of the database) on my computer with plenty of RAM and disk space for virtual memory (8Go of RAM - Windows 7 64bits).

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

I can't reproduce it.
Yes, "execute block" is failed with "unable to allocate memory from operation system" error.
After error i can commit it and query the table which have a lot of records.
I can repeat it, reconnect, etc. No crash, no database corruption.

@firebird-automations
Copy link
Collaborator Author

Commented by: alainsegura (linlin13)

Ok.
Certainly because we don't use the same database and because the problem depend on initial database state (i suppose).

From the following url you can download the database i used for testing.

http://pointex.perso.sfr.fr/

In some hours (3hours) you will also find a Virtual PC (Microsoft Virtual PC 2007) VMC/VHD with complete environnment (Zipped 900 Mo). From this virtual machine I'm sure you will reproduce the same errors than me.

Bye.

@firebird-automations
Copy link
Collaborator Author

Commented by: alainsegura (linlin13)

It seem that there is data corruption or no depending on windows memory settings.

If the computer has not enough memory, the command stop with error before any data corruption.

If th system is set to let windows handle virtual memory -> error with data corruption.

In the virtual PC, i have no problem to get the database corrupted (!!).

It will be ready soon for upload.

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

> If the computer has not enough memory, the command stop with error before any data corruption.
> If th system is set to let windows handle virtual memory -> error with data corruption.

My machine have 8GB of RAM is it enough ? But i see no corruption.

PS I don't use VirtualPC and not going to use it, sorry

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

I reproduced a crash and it is *independent* on temporary\persistent kind of table.
It is enough to
- create table
- disconnect
- connect
- run execute block

But i still see no corruption of database.

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

The corruption must be reproducible without the use of a virtual machine.

The "corruption" you are reporting is more likely due to issues with the virtual system, not with Firebird itself.

Separately, What OS are you running on? 32 or 64bit? How much RAM does the computer have? What setting ("If the system is set to let windows handle virtual memory") are you referring to? (Is this an Virtual PC setting or a native windows setting?)

@firebird-automations
Copy link
Collaborator Author

Commented by: alainsegura (linlin13)

I don't agree with you.

To reproduce a bug, the testing configuration must match the configuration experiencing the bug. It's not easy depending on what is involved in the bug (the database file, the OS, the memory etc...)

For me the engine the MUST KEEP THE FILE IN A SAFE STATE even in the case of an OS call fail.

To repoduce the database crash (wrong page type xxx found x expected y)

* Download the virtual machine here: http://pointex.perso.sfr.fr (ready to download in 3hour from now)
(Virtual PC 2007 is free to download on http://microsoft.com)
* Run flamerobin
* Connect to the "test" database
* Execute the SQL command "execute procedure testproc"
* when the "not enough memory" appear commit (or rollback) the transaction
* Disconnect then reconnect -> the database is corrupted

(in the database folder there is a backup of the initial database file).

The size/content of the database seem to be important to reproduce the problem,
I have reduced the windows memory settings, because the more memory the system has, longer you have to wait.

here is the procedure (to test on the given database or on another one).

CREATE PROCEDURE TESTPROC
AS
declare variable i integer;
declare variable j integer;
begin
i=0;
while (i<500) do begin

in autonomous transaction do begin
execute statement 'recreate global temporary table test (x varchar(20) not null, y varchar(20) not null) on commit
preserve rows';
execute statement 'create unique index xtest on test (x,y)';
execute statement 'grant all on test to FMUSER';
end
j=0;
while (j<10000) do begin
in autonomous transaction do execute statement 'insert into test (x,y) values ( ' || cast(i as varchar(20)) || ',
' || cast( j as varchar(20)) || ')';
j=j+1;
end
i=i+1;
end
end

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Could you put corrupted database for download at your site ?

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

You are trying to hold the database engine responsible for error which may be due to the OS/Virtual environment. That is an impossible situation.

@firebird-automations
Copy link
Collaborator Author

Commented by: alainsegura (linlin13)

The database corruption is experienced in real world situations, on different computers and OSs (but allways Windows). Not on Virtualized environnemnts. It's real and reproductible.
Sure not all computers/database experienced the problem. Hard to analyse differences between them.(Allways database 3 or 4 years old and >100Mo).

I don't say "here is the responsible", it's good, it's bad, i just try to help FB team to build a reliable engine.

The difficulty for me, is to build a test case, without the complexity of the real world application. The VM is the solution i have found to make "easy" to reproduce. Not the best, but not the worst.

The corrupted database will be available for download in 1hour (from now).

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

I have just run TESTPROC and it ended with "Transaction is not active"*.

I ran database validation and it found no errors.

* Edit - corrected the error message

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Vlad, you should be able to import the VM into the VM host of your choice; VMWare Player, VMWare Workstation or Oracle VirtualBox.

@firebird-automations
Copy link
Collaborator Author

Commented by: alainsegura (linlin13)

Corrupted database is ready for download.
The VM "Virtual PC" is ready for download (VHD/VMC)

I think the simple way is to install Microsoft Virtual PC 2007 (you can download it and use it for free and uninstall after usage).

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Ran the test again, same result. Ran validation, no errors.

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Corrupted database is very interesting and unusual.
Header page of every database contains few fields with info about database structure (such as page size) which is necessary to know to work with database.
One of such fileds is page number of first pointer page of RDB$PAGES relation (struct header_page::hdr_PAGES).
It is always assigned value of 3 for Firebird databases in ODS 10 - 11.2
But in this database this field have value of 253.
This is very suspicious as all other fields looks OK.

Look like someone attempt to change database in incompatible way to allow only modified engine to work with it.

I don't believe that corruption could change *one* byte of header page !
Also, note : 253 = 256 - 3

PS After i changed this field back to the value of 3 i can validate database and found no errors.

@firebird-automations
Copy link
Collaborator Author

Commented by: alainsegura (linlin13)

?????

It look like a "buffer overrun".

You have the database (A) BEFORE and (B) AFTER corruption.
And you have the VM to pass from A to B.
I use the lastest engine 2.5.1.

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

I told you what i see. If you don't believe me - i see no reason to continue.

I have no Virtual PC and i'm not going to destroy my working machine just because you want it.
I'll try to import your VM image into VMWare Workstation which i have installed.

@firebird-automations
Copy link
Collaborator Author

Commented by: alainsegura (linlin13)

Sorry, but you saw only a wrong value in a file.
I use a regular version of the engine, and only the engine/OS change the value. The 2 versions of the files was separated only by the "execute procedure testproc" statement. Nothing else.

If you can't import my VM into VMWare workstation, i will build a new virtual machine using VMWare.

@firebird-automations
Copy link
Collaborator Author

Commented by: alainsegura (linlin13)

I ran the test again (on an other computer) and the error is not on the same page.

the file will be available for download (30minutes) if someone interested...

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Imported VM image into VMWare and reproduced corruption.
(It is not a fun to work with French Windows, btw).
Will investigate it tomorrow.

@firebird-automations
Copy link
Collaborator Author

Modified by: @hvlad

assignee: Vlad Khorsun [ hvlad ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants