Issue Details (XML | Word | Printable)

Key: CORE-3948
Type: New Feature New Feature
Status: Open Open
Priority: Minor Minor
Assignee: Unassigned
Reporter: Jörg Fröber
Votes: 6
Watchers: 9
Operations

If you were logged in you would be able to see more operations.
Firebird Core

Optimize disk space consumption for BLOB concatenation operations

Created: 08/Oct/12 10:36 AM   Updated: 27/Apr/20 11:36 AM
Component/s: Engine
Affects Version/s: 2.5.1
Fix Version/s: None

Environment: Windows 7 Professional 64bit
Issue Links:
Relate
 


 Description  « Hide
When a blob is being concatenated mulitple times, the disk space for the database file grows massively. The only way to get the file size back to normal is a backup and restore of the database.

The issue can be reproduced as described below:

- Create an empty database
- Create the following procedure:
SET TERM ^ ;
CREATE OR ALTER PROCEDURE P_TEST (
  TEST VARCHAR(200),
  ITERATIONS INTEGER)
RETURNS (
  BLOBTEXT BLOB SUB_TYPE 0 SEGMENT SIZE 80)
AS
DECLARE COUNTER INTEGER;
BEGIN
  COUNTER = 0;
  BLOBTEXT = '';

  WHILE (COUNTER < ITERATIONS)
  DO BEGIN
    BLOBTEXT = BLOBTEXT || ' ' || TEST;
    COUNTER = COUNTER + 1;
  END

  SUSPEND;
END^
SET TERM ; ^

GRANT EXECUTE ON PROCEDURE P_TEST TO "PUBLIC";
GRANT EXECUTE ON PROCEDURE P_TEST TO SYSDBA;

- run the procedure:
SELECT BLOBTEXT FROM P_TEST('This is a test sentence, which is going to be concatenated.', 4000);

After the procedure has been executed, the disk space used by the database file increases to over 1 GB. It makes no difference, if the transaction is committed or not.
When you run the procedure another time using the SQL-command EXECUTE PROCEDURE P_TEST('This is a test sentence, which is going to be concatenated.', 4000); the file size increases again.

 All   Comments   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Adriano dos Santos Fernandes added a comment - 08/Oct/12 04:00 PM
If blob concatenation is moved to temporary space, then the cost of concatenate followed be its write to database will double.

After commit/rollback, this wasted database space will be filled again.

Sean Leyne added a comment - 08/Oct/12 06:57 PM
Based on Adriano's comments, it seems the current functionality is "as designed/won't fix".

It is a temporary condition, the un-needed space/pages will be released to the engine for re-use for other purposes.

So, this case should be closed.

Objections?

Adriano dos Santos Fernandes added a comment - 08/Oct/12 07:08 PM
Well, there is room for improvements IMO. But this involves a kind of garbage colletor for blobs in PSQL. It's far from be easy!

Sean Leyne added a comment - 08/Oct/12 07:32 PM
I have edited the case description to better reflect the nautre of the case, changes required as well as to turn the case into a feature request.

André Schwabe added a comment - 09/Oct/12 07:15 AM
In the case that we noticed this with a customer database went from 2,2Gb to 128Gb until it crashed because there was no more space. Even adding 500 Characters to a Blob 5000 times increases the database size by 2 Gb permanently. Surely there is room for improvement.

Mark Rotteveel added a comment - 09/Oct/12 08:33 AM
How can concatenating 4000 * 59 characters = 236000 characters - best case 236kb, worst case 944kb - result in a 1 GB increase. More importantly, how can that kind of excessive allocation not be considered a bug?

Hilmar Brodner added a comment - 09/Oct/12 01:16 PM
I agree with Mark that this surely cannot be a feature request. What feature would be requested here? "Add characters to a blob without using 1000 times the necessary space"? Sorry, Sean, not marking this as a bug but as a feature request would need some more explanation. As I understand Adriano this is also not "as designed" and we should see about "won't fix".

I understand Adriano that evidently all of the blob operations are done not in memory but on disc. One cannot do that in memory because of the huge size a blob could have. So optimum solution imho would be to do it on disc like with the temp files that are created when the sort buffer runs out of space. As I take it changing that is a rather large chunk of work so that we are going to have to live a while longer with on disc operations.
But what about the point that Mark brought up. Doing the maths it seems that the blob takes more and more space during the operation to arrive at 1Gb. Let's say that we are just concatenating the letter "a". So what seems to be written to disc is not "a" -> "aa" -> "aaa" -> "4000 a" using the same pages but rather using a separate and new page (or several depending on size) each time it is added: First page blocked with "a", next/second page blocked with "aa" and so on. I haven't found any documents regarding how the on disc handling on blobs is done so this is just a guess, please correct me or show me a link to a document. Should I be right however, maybe it would be less work concatenating like a varchar would do instead of writing down the whole thing anew with each operation.

Adriano dos Santos Fernandes added a comment - 09/Oct/12 03:52 PM
I'll recheck it, as I implemented solution for CORE-1658, which is a related problem.

Vlad Khorsun added a comment - 09/Oct/12 05:49 PM
> Even adding 500 Characters to a Blob 5000 times increases the database size by 2 Gb permanently
You created 5000 blob's. And this is really "as designed".
And increase in size shouldn't be pemanent - after transaction end all 5000-1 blobs will be released and space will be available again.

The real solution, as Adriano said, includes kind of garbage collector for blobs, also it probably includes reference counting for temporary blobs and ability to append same instance of temporary blob without creating new one.

PS Hilmar, you guesses are correct

Hilmar Brodner added a comment - 10/Oct/12 07:59 AM - edited
We played around with this a bit more. In order to get around this problem we though we might actually write the concatenation to disc. So instead of doing this in "memory" we are doing updates on a table (terribly slow, but at least without taking 120 Gb of disc space). We however need to return more than one record so we have the following constellation:

CREATE TABLE BBLOB (
    ID D_ID,
    BLOBBY D_BLOBTEXT
);

SET TERM ^ ;
CREATE OR ALTER PROCEDURE P_TEST2
RETURNS (
  BLOB_TEXT BLOB SUB_TYPE 1 SEGMENT SIZE 80)
AS
DECLARE COUNTER SMALLINT;
BEGIN
  DELETE FROM BBLOB;

  INSERT INTO BBLOB( ID, BLOBBY) VALUES ( 1,'');

  COUNTER = 3;
  WHILE (COUNTER > 0)
  DO BEGIN
    UPDATE BBLOB SET BLOBBY = BLOBBY || 'a' WHERE ID = 1;

    SELECT BLOBBY FROM BBLOB WHERE ID = 1 INTO :BLOB_TEXT;

    SUSPEND;
    COUNTER = COUNTER - 1;
  END
END^
SET TERM ; ^

GRANT SELECT,INSERT,DELETE,UPDATE ON BBLOB TO PROCEDURE P_TEST2;
GRANT EXECUTE ON PROCEDURE P_TEST2 TO SYSDBA;

Now if you do a "SELECT * FROM P_TEST2" you get a "BLOB not found" error message. Why? Well, seemingly the blob is not in memory if you do a SELECT (we recall: it sort of works if we do this in "memory").

So we went a bit further and added a single line that shouldn't do anything:

    BLOB_TEXT = BLOB_TEXT || '';

ending up with

SET TERM ^ ;
CREATE OR ALTER PROCEDURE P_TEST2
RETURNS (
  BLOB_TEXT BLOB SUB_TYPE 1 SEGMENT SIZE 80)
AS
DECLARE COUNTER SMALLINT;
BEGIN
  DELETE FROM BBLOB;

  INSERT INTO BBLOB( ID, BLOBBY) VALUES ( 1,'');

  COUNTER = 3;
  WHILE (COUNTER > 0)
  DO BEGIN
    UPDATE BBLOB SET BLOBBY = BLOBBY || 'a' WHERE ID = 1;

    SELECT BLOBBY FROM BBLOB WHERE ID = 1 INTO :BLOB_TEXT;

    BLOB_TEXT = BLOB_TEXT || '';

   SUSPEND;
   COUNTER = COUNTER - 1;

 END
END^
SET TERM ; ^

GRANT SELECT,INSERT,DELETE,UPDATE ON BBLOB TO PROCEDURE P_TEST2;
GRANT EXECUTE ON PROCEDURE P_TEST2 TO SYSDBA;

If we do our "SELECT * FROM P_TEST2" again, everything works fine.
So evidently doing a SELECT on a Blob and suspending it doesn't put it into memory, leading to problems, but accessing (modifying it without really doing anything) the same blob leads to it being put into memory and everything works like a charm?

I'm unsure if I should open a new thread with this as it might be a related issue. This should qualify as a bug however, shouldn't it?

Vlad Khorsun added a comment - 11/Oct/12 08:03 PM
> Now if you do a "SELECT * FROM P_TEST2" you get a "BLOB not found" error message. Why?
Because SUSPEND returns to you blob_id of blob but update on next iteration removed this blob and replaced it by new one.

This is definitely support question.

PS Blob's is not varchar's, its was not designed for manipulation at server side.

Hilmar Brodner added a comment - 02/Nov/12 08:58 AM
Sorry, but did you read my post? The problem is exactly the other way around: If you do server side manipulation it works, if you do not do it you get an error. Next iteration removes it only if you do NOT do this: BLOB_TEXT = BLOB_TEXT || ''; If you do this server side manipulation it does work...

The "Why?" was a rhethorical question, answered in the same line.

The problem is the other way around to your answer which thus points it to be a bug and not a support issue.

"...not designed for manipulation at server side"

Well, http://www.firebirdsql.org/file/documentation/reference_manuals/reference_material/Firebird-2.5-LangRef-Update.pdf says that this has changed and all the operations we do above should work. Page 20:

"
Full support for:
- = (assignment);
- =, <>, <, <=, >, >= and synonyms (comparison);
- || (concatenation);
- BETWEEN, IS [NOT] DISTINCT FROM, IN, ANY|SOME and ALL.
"

Vlad Khorsun added a comment - 02/Nov/12 09:29 AM - edited
Hilmar,

this is *support* question, and - yes, i've read your post.

Ok, last try to explain how things works internally.

- engine *can't* change blobs, it can only *create new* blob, this is as designed many years ago
- every time you think you change blob engine creates new temporary (not assigned to any relation) blob
- if temporary blob assigned to some relation (using INSERT or UPDATE) it become premanent blob
- at transaction end all temporary blobs are released with all space its occupied
- when blob field is UPDATE'd engine put blob_id of new blob into the blob field and old blob should be removed from disk
- when same transaction UPDATE'd blob in the same record few times the garbage collection is not deferred but immediate

Back to your procedure:
- at INSERT engine created permanent blob, with blob_id, say P1

- at 1st UPDATE engine creates new permanent blob P2 and placed P1 into undo-log
- SELECT returns P2
- SUSPEND returns P2 to the client

- at 2nd UPDATE engine creates new permanent blob P3 and immediately removed P2 as it is impossible to undo to the point where P2 exists
- SELECT returns P3
- SUSPEND returns P3 to the client
....

when client application opens P2 it receives "blob not found" error, obviously.



When you add BLOB_TEXT = BLOB_TEXT || ''; before the SUSPEND - you completely changed the game, look:

- at INSERT engine created permanent blob, with blob_id, say P1

- at 1st UPDATE engine creates new permanent blob P2 and placed P1 into undo-log
- SELECT returns P2
- BLOB_TEXT = BLOB_TEXT || '' - creates new temporary blob T1
- SUSPEND returns T1 to the client

- at 2nd UPDATE engine creates new permanent blob P3 and immediately removed P2 as it is impossible to undo to the point where P2 exists
- SELECT returns P3
- BLOB_TEXT = BLOB_TEXT || '' - creates new temporary blob T2
- SUSPEND returns T2 to the client
...
when application opens T1 - its successfull as temporary blobs retained until transaction ends.


I hope now it is more clear.

Hilmar Brodner added a comment - 06/Nov/12 07:42 AM
Vlad,

I understand how Firebird works in this respect perfectly well. My problem is with the way it does work:

1. Firebird documentation says Blobs now support a lot of the operations that Varchars can do. When you use them you get strange results and memory leaks (having a database go from 2 Gb to 140 Gb while working with Blobs and then crashing because of having run out of space is NOT okay and it is NOT a problem that can be solved by support, thus not for the support forum).

2. Failure to be able to do a sort of garbage collection on Blobs so the database will stay huge until a restore is done: Will not be able to be fixed by support. I also fail to see how you can argue that this is "as designed" and works perfectly well.

3. I understand why things work like they do when I use a concatenation. Nonetheless it is wrong. It is like having to use "iInteger = iInteger + 0;" in order to be able to suspend it. (Yes, I know it is not the same and you cannot 100% compare it but I think you see my point).


So if you can tell me how to join a few thousand lines to a Blob without crashing a live production system database and without increasing the database by factor 100 or more I will gladly accept this as a support question. Otherwise I would gladly appreciate this point not being closed because it's "as designed" or a "support issue". And maybe somewould should think it over to say concatenation of Blobs works like a charm in the release notes.

Vlad Khorsun added a comment - 06/Nov/12 08:28 AM
Hilmar,

> I understand how Firebird works in this respect perfectly well.
I don't see it from your comments.

...

> So if you can tell me how to join a few thousand lines to a Blob
Sorry, i can't tell you how to do terrible wrong things.

PS It is not productive discussion and i see no point to make arguments which is not used by other side.
More, place for discussions is at support\devel list, not here.

PPS There is room for improvements in blob handling at engine side but it will never support crazy usage.
So far, massive changing of blobs in PSQL is not practical, while supported by the engine itself.

André Luís de Oliveira Pereira added a comment - 15/Mar/13 06:17 PM
Temos mais de 200 clientes utilizando o Firebird, sendo que neste momento todos que utilizam o sistema de Contabilidade, Receitas, RH e Saneamento estão com a versão 2.1 e 2.5, e estão apresentando problemas de espaço em disco quando executado procedimentos que manipulam campos BLOB e principalmente quando concatenam os mesmos, houve vários casos neste inicio de ano ao calcular IPTU no sistema de Receitas, no qual um banco de dados de 60GB saltou para 135GB parando o Firebird e causando um problema enorme com um dos clientes TOP 10, gostaria de saber quando este problema será resolvido, estamos tentando contornar criando campos do tipo VARCHAR com o tamanho de 32000, mas ficaram muito lento, existe outra solução, porque não usar o temporário para realizar estas manipulações com procedimentos e campos BLOB e após serem descartados, podem criar arquivos de cache grandes mas após seriam eliminados.

Sean Leyne added a comment - 15/Mar/13 07:09 PM
Edited the description for readability

Vlad Khorsun added a comment - 16/Mar/13 08:30 AM
Andre, use english, please

Agris added a comment - 05/Sep/13 12:05 PM
Today I encountered this issue while writing procedure that had to generate and return CSV file with approx. 5000 rows. As you are quessing that - I made for select loop and was concatenating selected field values into blob variable. Example code:
FOR SELECT a.f1, a.f2
  FROM t1 as a
  INTO :f1, :f2
DO BEGIN
  blob_fld = blob_fld || f1 || ',' || f2 || ascii_char(13);
END

On small data amounts (~100 rows) it was working fast - less than 100ms. But running in production database - on 5000 rows - I got execution time almost 4minutes. Query itself was was executed in less than 100ms. I started googling and found this thread.
I wrote workaround for this problem. Not a real solution, but a working code for my usecase. Maybe it helps someone. Instead of looping throught all resultset and concatenating row by row to blob - I wrote select using LIST aggregate function. It solved performance problem and I got my CSV file with 5000+ rows in less than 500ms.
Resulting query:
SELECT LIST(f1 || ',' || f2, ascii_char(13))
FROM (
  SELECT a.f1, a.f2
    FROM t1 as a
  ) as x
INTO :blob_fld;

Note that in LIST documentation it is specified, that "The ordering of the list values is undefined". In my case I didn't encounter any crazy data ordering.

Karol Bieniaszewski added a comment - 27/Apr/20 07:15 AM - edited
I see simple solution for this.
For all variables declared inside procedure do not change blob id and do not allocate new one.
Simply override it.
Only when assigning to returning blob you do such thing like new allocations.
Then code as provided should be changed to (intorduce variable and operate on it):

CREATE OR ALTER PROCEDURE P_TEST (
  TEST VARCHAR(200),
  ITERATIONS INTEGER)
RETURNS (
  BLOBTEXT BLOB SUB_TYPE 0 SEGMENT SIZE 80)
AS
DECLARE COUNTER INTEGER;
DECLARE VARIABLE VAR_TMP BLOB SUB_TYPE 0 SEGMENT SIZE 80;
BEGIN
  COUNTER = 0;
  VAR_TMP = '';

  WHILE (COUNTER < ITERATIONS)
  DO BEGIN
    VAR_TMP = VAR_TMP || ' ' || TEST;
    COUNTER = COUNTER + 1;
  END

  BLOBTEXT = VAR_TMP;

  SUSPEND;
END^
SET TERM ; ^

i supose this can be simply implemented by FB team

or maybe even simpler - only SUSSPEND should produce new blob id and reallocate it - all other operations inside stored proc should be without new and new temporary blobs.

Vlad Khorsun added a comment - 27/Apr/20 11:05 AM
Karol,

no, it is not as simple as we all wish, unfortunately. Else it would have been implemented years ago.

So far, I can only repeat myself:

> The real solution, as Adriano said, includes kind of garbage collector for blobs, also it probably includes reference counting for temporary blobs
> and ability to append same instance of temporary blob without creating new one.

BTW, ticket type should be "Improvement", not "New feature", imo

Adriano dos Santos Fernandes added a comment - 27/Apr/20 11:36 AM
I do also have half-developed RDB$BLOB_UTIL system package which solve this problem in PSQL.

However it probably will not be in v4.0.