Brand new Y2025 bug ;-)

So far, “integer” has always been sufficient for me to handle my data in terms of file sizes, character counts, etc.

Of course, Delphi is using “integer” extensively for all such operations internally too, i.e.: TFile’s ReadAllLines method, or TStringList’s LoadFromFile, etc.

But now, in the age of AI, I’m finding that my training datasets are going into multi-TB range and I’m sure we will all get there soon enough.

Basically, all Delphi RTL’s must be updated with some bigger data type. ASAP. Or things would begin to break rapidly.

Windows API’s have actually been using bigger numbers for some time now. And some Delphi RTL’s also use things like NativeUInt, or split it into 2 integers, but not often.

Has anybody heard of anyone at Emb/Idera raising this issue?

I’ll bite: who’s using lists that have more than 2,147,483,647 entries in them?

Defaults in C#/.NET are all integers, and I have very rarely needed to us longs. The cost of storage and processing to use TBs of data just aren’t worth it in my opinion. Not saying don’t, but data and processing limits will be more limiting than 32-bit integers in the main

Maybe not for everyone, or not today, or not all the time, but I already have files I cannot load with standard RTL, so it’s already happening…

Alex

That would likely be 16Gb of memory, minimum. a 64-bit pointer for each item - not including the actual list item data! A TStringList would be starting at 32Gb and would most likely be much larger than that!

Not sure if the memory allocator would allow that amount of memory to be allocated - they appear to use a single contiguous block of memory to allocate a complete list of pointers to the items!

Loading with which RTL classes, exactly? Can you provide a reproducible example? Obviously do not include huge files, but it might help to know what size they are.

  • Can you provide a reproducible example?

Absolutely, I did provide 2 examples:

TFile’s ReadAllLines method, or

TStringList’s LoadFromFile

So, the solution for me was to readln/writeln a Text file instead.

Alex

They’re examples of methods, not examples of how to reproduce the problem. Those methods work fine for me. Please give a reproducible example of how they do not work for you. As described earlier, please also include what file sizes (i.e. not the actual files)

Just try them with a 5GB text file :wink: – in a typical fashion, no tricks, it’s just the size that matters.

And what’s 5 (or 50 for that matter) GB these days? Surely that would soon become 5-50TB, the way things are going. I have a 90TB drive now, but it’s filling rather fast…

Alex

It looks like the issue is that methods like ReadAllLines (which ultimately calls DoReadAllBytes) are attempting to read everything at once from the system, and at least Windows and macOS are saying “NO”. They could probably be changed so that the data is read in chunks, instead.

The ultimate error at the end was that it saw the buffer size as a negative value, after the overflow of integer. Or just crashed with an overflow error, unless the checks were disabled.

Alex

Out of curiosity, I “patched” DoReadAllBytes (called by ReadAllLines) so that it reads the data in chunks (of 1MB, or smaller when it reaches the end) - now the problem is that TEncoding.GetString fails with a range check error, since it only accepts Integer params, so for ReadAllLines, there’s at least 2 fixes needed.

For TStrings.LoadFromFile, it ultimately calls TStrings.LoadFromStream which right from the outset uses an Integer type for the size, but will suffer the same fate as ReadAllLines anyway, since it also uses TEncoding.GetString.

I wouldn’t expect the ‘solution’ to be quite that simple.

I’m curious as to when the code that must have been in the CP/M version of Turbo Pascal to handle these sorts of issues got removed, as everyone in all products seems to have done.

Once upon a time handling the ‘file bigger than memory issue’ was a given, yet these days I always have difficulty finding an editor that can handle the huge ‘log file’ issue.

Kinda weird that the authors didn’t even think they should break the log file ‘every day’, or some other reasonable period.

This is going to sound like a shameless plug* but it’s not.

UltraEdit https://www.ultraedit.com can handle MASSIVE files and it responds to searches in those files pretty much instantaneously. I use it for dealing with obscenely large SQL text dump files of 2gb or more.

Note that is obscenely large not obscene files - you’ll need a whole different editor for those… :zany_face:

*[UltraEdit is also owned by Idera and I sometimes do DevRel stuff for them too]

1 Like

I expect it doesn’t use the technique of loading an entire file at once into an array of strings. Apparently there seems to be at least one case where it is necessary :man_shrugging:

1 Like

It definitely doesn’t do that! lol

Whatever they do, it’s fast. I mentioned it in connection with the comment about reading big log files.

  • there’s at least 2 fixes needed

I’m sure there are many more: it’s wide-spread, they are using Integers everywhere. Like 40 years ago they may have used Words for it, working with kB’s.

Alex

We have an app that parses 5GB+ CSV files from a third party data provider. It uses TBufferedFileStream for the file access, which uses Int64 for seeking/size. We are loading into our own data structure. I have considered changing our CSV parser to use mmap() but that’s an exercise for another day. I shudder to think about using TStringList for super large files!

For tools there is also klogg (GitHub - variar/klogg: Really fast log explorer based on glogg project). Blazingly fast at loading, searching, filtering and highlighting with huge files. It’s so fast that I’m convinced that it is either lying or magic :smiley:

3 Likes

LOL

FWIW - using TStringList to load large files is a Really Bad Idea IMO too.

1 Like